linux_dsm_epyc7002/fs/nfs/dir.c

2691 lines
68 KiB
C
Raw Normal View History

/*
* linux/fs/nfs/dir.c
*
* Copyright (C) 1992 Rick Sladkey
*
* nfs directory handling functions
*
* 10 Apr 1996 Added silly rename for unlink --okir
* 28 Sep 1996 Improved directory cache --okir
* 23 Aug 1997 Claus Heine claus@momo.math.rwth-aachen.de
* Re-implemented silly rename for unlink, newly implemented
* silly rename for nfs_rename() following the suggestions
* of Olaf Kirch (okir) found in this file.
* Following Linus comments on my original hack, this version
* depends only on the dcache stuff and doesn't touch the inode
* layer (iput() and friends).
* 6 Jun 1999 Cache readdir lookups in the page cache. -DaveM
*/
#include <linux/module.h>
#include <linux/time.h>
#include <linux/errno.h>
#include <linux/stat.h>
#include <linux/fcntl.h>
#include <linux/string.h>
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/sunrpc/clnt.h>
#include <linux/nfs_fs.h>
#include <linux/nfs_mount.h>
#include <linux/pagemap.h>
#include <linux/pagevec.h>
#include <linux/namei.h>
NFS: Share NFS superblocks per-protocol per-server per-FSID The attached patch makes NFS share superblocks between mounts from the same server and FSID over the same protocol. It does this by creating each superblock with a false root and returning the real root dentry in the vfsmount presented by get_sb(). The root dentry set starts off as an anonymous dentry if we don't already have the dentry for its inode, otherwise it simply returns the dentry we already have. We may thus end up with several trees of dentries in the superblock, and if at some later point one of anonymous tree roots is discovered by normal filesystem activity to be located in another tree within the superblock, the anonymous root is named and materialises attached to the second tree at the appropriate point. Why do it this way? Why not pass an extra argument to the mount() syscall to indicate the subpath and then pathwalk from the server root to the desired directory? You can't guarantee this will work for two reasons: (1) The root and intervening nodes may not be accessible to the client. With NFS2 and NFS3, for instance, mountd is called on the server to get the filehandle for the tip of a path. mountd won't give us handles for anything we don't have permission to access, and so we can't set up NFS inodes for such nodes, and so can't easily set up dentries (we'd have to have ghost inodes or something). With this patch we don't actually create dentries until we get handles from the server that we can use to set up their inodes, and we don't actually bind them into the tree until we know for sure where they go. (2) Inaccessible symbolic links. If we're asked to mount two exports from the server, eg: mount warthog:/warthog/aaa/xxx /mmm mount warthog:/warthog/bbb/yyy /nnn We may not be able to access anything nearer the root than xxx and yyy, but we may find out later that /mmm/www/yyy, say, is actually the same directory as the one mounted on /nnn. What we might then find out, for example, is that /warthog/bbb was actually a symbolic link to /warthog/aaa/xxx/www, but we can't actually determine that by talking to the server until /warthog is made available by NFS. This would lead to having constructed an errneous dentry tree which we can't easily fix. We can end up with a dentry marked as a directory when it should actually be a symlink, or we could end up with an apparently hardlinked directory. With this patch we need not make assumptions about the type of a dentry for which we can't retrieve information, nor need we assume we know its place in the grand scheme of things until we actually see that place. This patch reduces the possibility of aliasing in the inode and page caches for inodes that may be accessed by more than one NFS export. It also reduces the number of superblocks required for NFS where there are many NFS exports being used from a server (home directory server + autofs for example). This in turn makes it simpler to do local caching of network filesystems, as it can then be guaranteed that there won't be links from multiple inodes in separate superblocks to the same cache file. Obviously, cache aliasing between different levels of NFS protocol could still be a problem, but at least that gives us another key to use when indexing the cache. This patch makes the following changes: (1) The server record construction/destruction has been abstracted out into its own set of functions to make things easier to get right. These have been moved into fs/nfs/client.c. All the code in fs/nfs/client.c has to do with the management of connections to servers, and doesn't touch superblocks in any way; the remaining code in fs/nfs/super.c has to do with VFS superblock management. (2) The sequence of events undertaken by NFS mount is now reordered: (a) A volume representation (struct nfs_server) is allocated. (b) A server representation (struct nfs_client) is acquired. This may be allocated or shared, and is keyed on server address, port and NFS version. (c) If allocated, the client representation is initialised. The state member variable of nfs_client is used to prevent a race during initialisation from two mounts. (d) For NFS4 a simple pathwalk is performed, walking from FH to FH to find the root filehandle for the mount (fs/nfs/getroot.c). For NFS2/3 we are given the root FH in advance. (e) The volume FSID is probed for on the root FH. (f) The volume representation is initialised from the FSINFO record retrieved on the root FH. (g) sget() is called to acquire a superblock. This may be allocated or shared, keyed on client pointer and FSID. (h) If allocated, the superblock is initialised. (i) If the superblock is shared, then the new nfs_server record is discarded. (j) The root dentry for this mount is looked up from the root FH. (k) The root dentry for this mount is assigned to the vfsmount. (3) nfs_readdir_lookup() creates dentries for each of the entries readdir() returns; this function now attaches disconnected trees from alternate roots that happen to be discovered attached to a directory being read (in the same way nfs_lookup() is made to do for lookup ops). The new d_materialise_unique() function is now used to do this, thus permitting the whole thing to be done under one set of locks, and thus avoiding any race between mount and lookup operations on the same directory. (4) The client management code uses a new debug facility: NFSDBG_CLIENT which is set by echoing 1024 to /proc/net/sunrpc/nfs_debug. (5) Clone mounts are now called xdev mounts. (6) Use the dentry passed to the statfs() op as the handle for retrieving fs statistics rather than the root dentry of the superblock (which is now a dummy). Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2006-08-23 07:06:13 +07:00
#include <linux/mount.h>
#include <linux/swap.h>
#include <linux/sched.h>
#include <linux/kmemleak.h>
#include <linux/xattr.h>
#include "delegation.h"
#include "iostat.h"
#include "internal.h"
#include "fscache.h"
#include "nfstrace.h"
/* #define NFS_DEBUG_VERBOSE 1 */
static int nfs_opendir(struct inode *, struct file *);
static int nfs_closedir(struct inode *, struct file *);
static int nfs_readdir(struct file *, struct dir_context *);
static int nfs_fsync_dir(struct file *, loff_t, loff_t, int);
static loff_t nfs_llseek_dir(struct file *, loff_t, int);
static void nfs_readdir_clear_array(struct page*);
const struct file_operations nfs_dir_operations = {
.llseek = nfs_llseek_dir,
.read = generic_read_dir,
NFS: switch back to to ->iterate() NFS has some optimizations for readdir to choose between using READDIR or READDIRPLUS based on workload, and which NFS operation to use is determined by subsequent interactions with lookup, d_revalidate, and getattr. Concurrent use of nfs_readdir() via ->iterate_shared() can cause those optimizations to repeatedly invalidate the pagecache used to store directory entries during readdir(), which causes some very bad performance for directories with many entries (more than about 10000). There's a couple ways to fix this in NFS, but no fix would be as simple as going back to ->iterate() to serialize nfs_readdir(), and neither fix I tested performed as well as going back to ->iterate(). The first required taking the directory's i_lock for each entry, with the result of terrible contention. The second way adds another flag to the nfs_inode, and so keeps the optimizations working for large directories. The difference from using ->iterate() here is that much more memory is consumed for a given workload without any performance gain. The workings of nfs_readdir() are such that concurrent users are serialized within read_cache_page() waiting to retrieve pages of entries from the server. By serializing this work in iterate_dir() instead, contention for cache pages is reduced. Waiting processes can have an uncontended pass at the entirety of the directory's pagecache once previous processes have completed filling it. v2 - Keep the bits needed for parallel lookup Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-03-11 05:07:46 +07:00
.iterate = nfs_readdir,
.open = nfs_opendir,
.release = nfs_closedir,
.fsync = nfs_fsync_dir,
};
const struct address_space_operations nfs_dir_aops = {
.freepage = nfs_readdir_clear_array,
};
static struct nfs_open_dir_context *alloc_nfs_open_dir_context(struct inode *dir, const struct cred *cred)
{
struct nfs_inode *nfsi = NFS_I(dir);
struct nfs_open_dir_context *ctx;
ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
if (ctx != NULL) {
ctx->duped = 0;
ctx->attr_gencount = nfsi->attr_gencount;
ctx->dir_cookie = 0;
ctx->dup_cookie = 0;
ctx->cred = get_cred(cred);
spin_lock(&dir->i_lock);
list_add(&ctx->list, &nfsi->open_files);
spin_unlock(&dir->i_lock);
return ctx;
}
return ERR_PTR(-ENOMEM);
}
static void put_nfs_open_dir_context(struct inode *dir, struct nfs_open_dir_context *ctx)
{
spin_lock(&dir->i_lock);
list_del(&ctx->list);
spin_unlock(&dir->i_lock);
put_cred(ctx->cred);
kfree(ctx);
}
/*
* Open file
*/
static int
nfs_opendir(struct inode *inode, struct file *filp)
{
int res = 0;
struct nfs_open_dir_context *ctx;
dfprintk(FILE, "NFS: open dir(%pD2)\n", filp);
nfs_inc_stats(inode, NFSIOS_VFSOPEN);
ctx = alloc_nfs_open_dir_context(inode, current_cred());
if (IS_ERR(ctx)) {
res = PTR_ERR(ctx);
goto out;
}
filp->private_data = ctx;
out:
return res;
}
static int
nfs_closedir(struct inode *inode, struct file *filp)
{
put_nfs_open_dir_context(file_inode(filp), filp->private_data);
return 0;
}
struct nfs_cache_array_entry {
u64 cookie;
u64 ino;
struct qstr string;
unsigned char d_type;
};
struct nfs_cache_array {
int size;
int eof_index;
u64 last_cookie;
struct nfs_cache_array_entry array[0];
};
NFS: readdirplus optimization by cache mechanism When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount | wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-01-29 14:34:17 +07:00
struct readdirvec {
unsigned long nr;
unsigned long index;
struct page *pages[NFS_MAX_READDIR_RAPAGES];
};
typedef int (*decode_dirent_t)(struct xdr_stream *, struct nfs_entry *, bool);
typedef struct {
struct file *file;
struct page *page;
struct dir_context *ctx;
unsigned long page_index;
NFS: readdirplus optimization by cache mechanism When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount | wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-01-29 14:34:17 +07:00
struct readdirvec pvec;
u64 *dir_cookie;
u64 last_cookie;
loff_t current_index;
decode_dirent_t decode;
unsigned long timestamp;
unsigned long gencount;
unsigned int cache_entry_index;
bool plus;
bool eof;
} nfs_readdir_descriptor_t;
/*
* we are freeing strings created by nfs_add_to_readdir_array()
*/
static
void nfs_readdir_clear_array(struct page *page)
{
struct nfs_cache_array *array;
int i;
array = kmap_atomic(page);
NFS: switch back to to ->iterate() NFS has some optimizations for readdir to choose between using READDIR or READDIRPLUS based on workload, and which NFS operation to use is determined by subsequent interactions with lookup, d_revalidate, and getattr. Concurrent use of nfs_readdir() via ->iterate_shared() can cause those optimizations to repeatedly invalidate the pagecache used to store directory entries during readdir(), which causes some very bad performance for directories with many entries (more than about 10000). There's a couple ways to fix this in NFS, but no fix would be as simple as going back to ->iterate() to serialize nfs_readdir(), and neither fix I tested performed as well as going back to ->iterate(). The first required taking the directory's i_lock for each entry, with the result of terrible contention. The second way adds another flag to the nfs_inode, and so keeps the optimizations working for large directories. The difference from using ->iterate() here is that much more memory is consumed for a given workload without any performance gain. The workings of nfs_readdir() are such that concurrent users are serialized within read_cache_page() waiting to retrieve pages of entries from the server. By serializing this work in iterate_dir() instead, contention for cache pages is reduced. Waiting processes can have an uncontended pass at the entirety of the directory's pagecache once previous processes have completed filling it. v2 - Keep the bits needed for parallel lookup Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-03-11 05:07:46 +07:00
for (i = 0; i < array->size; i++)
kfree(array->array[i].string.name);
kunmap_atomic(array);
}
/*
* the caller is responsible for freeing qstr.name
* when called by nfs_readdir_add_to_array, the strings will be freed in
* nfs_clear_readdir_array()
*/
static
int nfs_readdir_make_qstr(struct qstr *string, const char *name, unsigned int len)
{
string->len = len;
string->name = kmemdup(name, len, GFP_KERNEL);
if (string->name == NULL)
return -ENOMEM;
/*
* Avoid a kmemleak false positive. The pointer to the name is stored
* in a page cache page which kmemleak does not scan.
*/
kmemleak_not_leak(string->name);
string->hash = full_name_hash(NULL, name, len);
return 0;
}
static
int nfs_readdir_add_to_array(struct nfs_entry *entry, struct page *page)
{
struct nfs_cache_array *array = kmap(page);
struct nfs_cache_array_entry *cache_entry;
int ret;
cache_entry = &array->array[array->size];
/* Check that this entry lies within the page bounds */
ret = -ENOSPC;
if ((char *)&cache_entry[1] - (char *)page_address(page) > PAGE_SIZE)
goto out;
cache_entry->cookie = entry->prev_cookie;
cache_entry->ino = entry->ino;
cache_entry->d_type = entry->d_type;
ret = nfs_readdir_make_qstr(&cache_entry->string, entry->name, entry->len);
if (ret)
goto out;
array->last_cookie = entry->cookie;
array->size++;
if (entry->eof != 0)
array->eof_index = array->size;
out:
kunmap(page);
return ret;
}
static
int nfs_readdir_search_for_pos(struct nfs_cache_array *array, nfs_readdir_descriptor_t *desc)
{
loff_t diff = desc->ctx->pos - desc->current_index;
unsigned int index;
if (diff < 0)
goto out_eof;
if (diff >= array->size) {
if (array->eof_index >= 0)
goto out_eof;
return -EAGAIN;
}
index = (unsigned int)diff;
*desc->dir_cookie = array->array[index].cookie;
desc->cache_entry_index = index;
return 0;
out_eof:
desc->eof = true;
return -EBADCOOKIE;
}
static bool
nfs_readdir_inode_mapping_valid(struct nfs_inode *nfsi)
{
if (nfsi->cache_validity & (NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA))
return false;
smp_rmb();
return !test_bit(NFS_INO_INVALIDATING, &nfsi->flags);
}
static
int nfs_readdir_search_for_cookie(struct nfs_cache_array *array, nfs_readdir_descriptor_t *desc)
{
int i;
loff_t new_pos;
int status = -EAGAIN;
for (i = 0; i < array->size; i++) {
if (array->array[i].cookie == *desc->dir_cookie) {
struct nfs_inode *nfsi = NFS_I(file_inode(desc->file));
struct nfs_open_dir_context *ctx = desc->file->private_data;
new_pos = desc->current_index + i;
if (ctx->attr_gencount != nfsi->attr_gencount ||
!nfs_readdir_inode_mapping_valid(nfsi)) {
ctx->duped = 0;
ctx->attr_gencount = nfsi->attr_gencount;
} else if (new_pos < desc->ctx->pos) {
if (ctx->duped > 0
&& ctx->dup_cookie == *desc->dir_cookie) {
if (printk_ratelimit()) {
pr_notice("NFS: directory %pD2 contains a readdir loop."
"Please contact your server vendor. "
"The file: %.*s has duplicate cookie %llu\n",
desc->file, array->array[i].string.len,
array->array[i].string.name, *desc->dir_cookie);
}
status = -ELOOP;
goto out;
}
ctx->dup_cookie = *desc->dir_cookie;
ctx->duped = -1;
}
desc->ctx->pos = new_pos;
desc->cache_entry_index = i;
return 0;
}
}
if (array->eof_index >= 0) {
status = -EBADCOOKIE;
if (*desc->dir_cookie == array->last_cookie)
desc->eof = true;
}
out:
return status;
}
static
int nfs_readdir_search_array(nfs_readdir_descriptor_t *desc)
{
struct nfs_cache_array *array;
int status;
array = kmap(desc->page);
if (*desc->dir_cookie == 0)
status = nfs_readdir_search_for_pos(array, desc);
else
status = nfs_readdir_search_for_cookie(array, desc);
if (status == -EAGAIN) {
desc->last_cookie = array->last_cookie;
desc->current_index += array->size;
desc->page_index++;
}
kunmap(desc->page);
return status;
}
/* Fill a page with xdr information before transferring to the cache page */
static
int nfs_readdir_xdr_filler(struct page **pages, nfs_readdir_descriptor_t *desc,
struct nfs_entry *entry, struct file *file, struct inode *inode)
{
struct nfs_open_dir_context *ctx = file->private_data;
const struct cred *cred = ctx->cred;
unsigned long timestamp, gencount;
int error;
again:
timestamp = jiffies;
gencount = nfs_inc_attr_generation_counter();
error = NFS_PROTO(inode)->readdir(file_dentry(file), cred, entry->cookie, pages,
NFS_SERVER(inode)->dtsize, desc->plus);
if (error < 0) {
/* We requested READDIRPLUS, but the server doesn't grok it */
if (error == -ENOTSUPP && desc->plus) {
NFS_SERVER(inode)->caps &= ~NFS_CAP_READDIRPLUS;
clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inode)->flags);
desc->plus = false;
goto again;
}
goto error;
}
desc->timestamp = timestamp;
desc->gencount = gencount;
error:
return error;
}
static int xdr_decode(nfs_readdir_descriptor_t *desc,
struct nfs_entry *entry, struct xdr_stream *xdr)
{
int error;
error = desc->decode(xdr, entry, desc->plus);
if (error)
return error;
entry->fattr->time_start = desc->timestamp;
entry->fattr->gencount = desc->gencount;
return 0;
}
/* Match file and dirent using either filehandle or fileid
* Note: caller is responsible for checking the fsid
*/
static
int nfs_same_file(struct dentry *dentry, struct nfs_entry *entry)
{
struct inode *inode;
struct nfs_inode *nfsi;
if (d_really_is_negative(dentry))
return 0;
inode = d_inode(dentry);
if (is_bad_inode(inode) || NFS_STALE(inode))
return 0;
nfsi = NFS_I(inode);
if (entry->fattr->fileid != nfsi->fileid)
return 0;
if (entry->fh->size && nfs_compare_fh(entry->fh, &nfsi->fh) != 0)
return 0;
return 1;
}
static
bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx)
{
if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS))
return false;
if (test_and_clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
return true;
if (ctx->pos == 0)
return true;
return false;
}
/*
* This function is called by the lookup and getattr code to request the
* use of readdirplus to accelerate any future lookups in the same
* directory.
*/
void nfs_advise_use_readdirplus(struct inode *dir)
{
struct nfs_inode *nfsi = NFS_I(dir);
if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
!list_empty(&nfsi->open_files))
set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
}
/*
* This function is mainly for use by nfs_getattr().
*
* If this is an 'ls -l', we want to force use of readdirplus.
* Do this by checking if there is an active file descriptor
* and calling nfs_advise_use_readdirplus, then forcing a
* cache flush.
*/
void nfs_force_use_readdirplus(struct inode *dir)
{
struct nfs_inode *nfsi = NFS_I(dir);
if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
!list_empty(&nfsi->open_files)) {
set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
invalidate_mapping_pages(dir->i_mapping, 0, -1);
}
}
static
void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
{
struct qstr filename = QSTR_INIT(entry->name, entry->len);
DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
struct dentry *dentry;
struct dentry *alias;
struct inode *dir = d_inode(parent);
struct inode *inode;
int status;
if (!(entry->fattr->valid & NFS_ATTR_FATTR_FILEID))
return;
if (!(entry->fattr->valid & NFS_ATTR_FATTR_FSID))
return;
if (filename.len == 0)
return;
/* Validate that the name doesn't contain any illegal '\0' */
if (strnlen(filename.name, filename.len) != filename.len)
return;
/* ...or '/' */
if (strnchr(filename.name, filename.len, '/'))
return;
if (filename.name[0] == '.') {
if (filename.len == 1)
return;
if (filename.len == 2 && filename.name[1] == '.')
return;
}
filename.hash = full_name_hash(parent, filename.name, filename.len);
dentry = d_lookup(parent, &filename);
again:
if (!dentry) {
dentry = d_alloc_parallel(parent, &filename, &wq);
if (IS_ERR(dentry))
return;
}
if (!d_in_lookup(dentry)) {
/* Is there a mountpoint here? If so, just exit */
if (!nfs_fsid_equal(&NFS_SB(dentry->d_sb)->fsid,
&entry->fattr->fsid))
goto out;
if (nfs_same_file(dentry, entry)) {
if (!entry->fh->size)
goto out;
nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
status = nfs_refresh_inode(d_inode(dentry), entry->fattr);
if (!status)
nfs_setsecurity(d_inode(dentry), entry->fattr, entry->label);
goto out;
} else {
d_invalidate(dentry);
dput(dentry);
dentry = NULL;
goto again;
}
}
if (!entry->fh->size) {
d_lookup_done(dentry);
goto out;
}
inode = nfs_fhget(dentry->d_sb, entry->fh, entry->fattr, entry->label);
alias = d_splice_alias(inode, dentry);
d_lookup_done(dentry);
if (alias) {
if (IS_ERR(alias))
goto out;
dput(dentry);
dentry = alias;
}
nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
out:
dput(dentry);
}
/* Perform conversion from xdr to cache array */
static
int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *entry,
struct page **xdr_pages, struct page *page, unsigned int buflen)
{
struct xdr_stream stream;
struct xdr_buf buf;
struct page *scratch;
struct nfs_cache_array *array;
unsigned int count = 0;
int status;
NFS: readdirplus optimization by cache mechanism When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount | wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-01-29 14:34:17 +07:00
int max_rapages = NFS_MAX_READDIR_RAPAGES;
desc->pvec.index = desc->page_index;
desc->pvec.nr = 0;
scratch = alloc_page(GFP_KERNEL);
if (scratch == NULL)
return -ENOMEM;
if (buflen == 0)
goto out_nopages;
xdr_init_decode_pages(&stream, &buf, xdr_pages, buflen);
xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE);
do {
status = xdr_decode(desc, entry, &stream);
if (status != 0) {
if (status == -EAGAIN)
status = 0;
break;
}
count++;
if (desc->plus)
nfs_prime_dcache(file_dentry(desc->file), entry);
NFS: readdirplus optimization by cache mechanism When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount | wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-01-29 14:34:17 +07:00
status = nfs_readdir_add_to_array(entry, desc->pvec.pages[desc->pvec.nr]);
if (status == -ENOSPC) {
desc->pvec.nr++;
if (desc->pvec.nr == max_rapages)
break;
status = nfs_readdir_add_to_array(entry, desc->pvec.pages[desc->pvec.nr]);
}
if (status != 0)
break;
} while (!entry->eof);
NFS: readdirplus optimization by cache mechanism When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount | wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-01-29 14:34:17 +07:00
/*
* page and desc->pvec.pages[0] are valid, don't need to check
* whether or not to be NULL.
*/
copy_highpage(page, desc->pvec.pages[0]);
out_nopages:
if (count == 0 || (status == -EBADCOOKIE && entry->eof != 0)) {
NFS: readdirplus optimization by cache mechanism When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount | wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-01-29 14:34:17 +07:00
array = kmap_atomic(desc->pvec.pages[desc->pvec.nr]);
array->eof_index = array->size;
status = 0;
NFS: readdirplus optimization by cache mechanism When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount | wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-01-29 14:34:17 +07:00
kunmap_atomic(array);
}
put_page(scratch);
NFS: readdirplus optimization by cache mechanism When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount | wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-01-29 14:34:17 +07:00
/*
* desc->pvec.nr > 0 means at least one page was completely filled,
* we should return -ENOSPC. Otherwise function
* nfs_readdir_xdr_to_array will enter infinite loop.
*/
if (desc->pvec.nr > 0)
return -ENOSPC;
return status;
}
static
void nfs_readdir_free_pages(struct page **pages, unsigned int npages)
{
unsigned int i;
for (i = 0; i < npages; i++)
put_page(pages[i]);
}
/*
* nfs_readdir_alloc_pages() will allocate pages that must be freed with a call
* to nfs_readdir_free_pages()
*/
static
int nfs_readdir_alloc_pages(struct page **pages, unsigned int npages)
{
unsigned int i;
for (i = 0; i < npages; i++) {
struct page *page = alloc_page(GFP_KERNEL);
if (page == NULL)
goto out_freepages;
pages[i] = page;
}
return 0;
out_freepages:
nfs_readdir_free_pages(pages, i);
return -ENOMEM;
}
NFS: readdirplus optimization by cache mechanism When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount | wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-01-29 14:34:17 +07:00
/*
* nfs_readdir_rapages_init initialize rapages by nfs_cache_array structure.
*/
static
void nfs_readdir_rapages_init(nfs_readdir_descriptor_t *desc)
{
struct nfs_cache_array *array;
int max_rapages = NFS_MAX_READDIR_RAPAGES;
int index;
for (index = 0; index < max_rapages; index++) {
array = kmap_atomic(desc->pvec.pages[index]);
memset(array, 0, sizeof(struct nfs_cache_array));
array->eof_index = -1;
kunmap_atomic(array);
}
}
static
int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, struct inode *inode)
{
struct page *pages[NFS_MAX_READDIR_PAGES];
struct nfs_entry entry;
struct file *file = desc->file;
struct nfs_cache_array *array;
int status = -ENOMEM;
unsigned int array_size = ARRAY_SIZE(pages);
NFS: readdirplus optimization by cache mechanism When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount | wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-01-29 14:34:17 +07:00
/*
* This means we hit readdir rdpages miss, the preallocated rdpages
* are useless, the preallocate rdpages should be reinitialized.
*/
nfs_readdir_rapages_init(desc);
entry.prev_cookie = 0;
entry.cookie = desc->last_cookie;
entry.eof = 0;
entry.fh = nfs_alloc_fhandle();
entry.fattr = nfs_alloc_fattr();
entry.server = NFS_SERVER(inode);
if (entry.fh == NULL || entry.fattr == NULL)
goto out;
entry.label = nfs4_label_alloc(NFS_SERVER(inode), GFP_NOWAIT);
if (IS_ERR(entry.label)) {
status = PTR_ERR(entry.label);
goto out;
}
array = kmap(page);
memset(array, 0, sizeof(struct nfs_cache_array));
array->eof_index = -1;
status = nfs_readdir_alloc_pages(pages, array_size);
if (status < 0)
goto out_release_array;
do {
unsigned int pglen;
status = nfs_readdir_xdr_filler(pages, desc, &entry, file, inode);
if (status < 0)
break;
pglen = status;
status = nfs_readdir_page_filler(desc, &entry, pages, page, pglen);
if (status < 0) {
if (status == -ENOSPC)
status = 0;
break;
}
} while (array->eof_index < 0);
nfs_readdir_free_pages(pages, array_size);
out_release_array:
kunmap(page);
nfs4_label_free(entry.label);
out:
nfs_free_fattr(entry.fattr);
nfs_free_fhandle(entry.fh);
return status;
}
/*
* Now we cache directories properly, by converting xdr information
* to an array that can be used for lookups later. This results in
* fewer cache pages, since we can store more information on each page.
* We only need to convert from xdr once so future lookups are much simpler
*/
static
int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page)
{
struct inode *inode = file_inode(desc->file);
int ret;
NFS: readdirplus optimization by cache mechanism When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount | wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-01-29 14:34:17 +07:00
/*
* If desc->page_index in range desc->pvec.index and
* desc->pvec.index + desc->pvec.nr, we get readdir cache hit.
*/
if (desc->page_index >= desc->pvec.index &&
desc->page_index < (desc->pvec.index + desc->pvec.nr)) {
/*
* page and desc->pvec.pages[x] are valid, don't need to check
* whether or not to be NULL.
*/
copy_highpage(page, desc->pvec.pages[desc->page_index - desc->pvec.index]);
ret = 0;
} else {
ret = nfs_readdir_xdr_to_array(desc, page, inode);
if (ret < 0)
goto error;
}
SetPageUptodate(page);
if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, -1) < 0) {
/* Should never happen */
nfs_zap_mapping(inode, inode->i_mapping);
}
unlock_page(page);
return 0;
error:
unlock_page(page);
return ret;
}
static
void cache_page_release(nfs_readdir_descriptor_t *desc)
{
NFS: switch back to to ->iterate() NFS has some optimizations for readdir to choose between using READDIR or READDIRPLUS based on workload, and which NFS operation to use is determined by subsequent interactions with lookup, d_revalidate, and getattr. Concurrent use of nfs_readdir() via ->iterate_shared() can cause those optimizations to repeatedly invalidate the pagecache used to store directory entries during readdir(), which causes some very bad performance for directories with many entries (more than about 10000). There's a couple ways to fix this in NFS, but no fix would be as simple as going back to ->iterate() to serialize nfs_readdir(), and neither fix I tested performed as well as going back to ->iterate(). The first required taking the directory's i_lock for each entry, with the result of terrible contention. The second way adds another flag to the nfs_inode, and so keeps the optimizations working for large directories. The difference from using ->iterate() here is that much more memory is consumed for a given workload without any performance gain. The workings of nfs_readdir() are such that concurrent users are serialized within read_cache_page() waiting to retrieve pages of entries from the server. By serializing this work in iterate_dir() instead, contention for cache pages is reduced. Waiting processes can have an uncontended pass at the entirety of the directory's pagecache once previous processes have completed filling it. v2 - Keep the bits needed for parallel lookup Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-03-11 05:07:46 +07:00
if (!desc->page->mapping)
nfs_readdir_clear_array(desc->page);
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 19:29:47 +07:00
put_page(desc->page);
desc->page = NULL;
}
static
struct page *get_cache_page(nfs_readdir_descriptor_t *desc)
{
NFS: switch back to to ->iterate() NFS has some optimizations for readdir to choose between using READDIR or READDIRPLUS based on workload, and which NFS operation to use is determined by subsequent interactions with lookup, d_revalidate, and getattr. Concurrent use of nfs_readdir() via ->iterate_shared() can cause those optimizations to repeatedly invalidate the pagecache used to store directory entries during readdir(), which causes some very bad performance for directories with many entries (more than about 10000). There's a couple ways to fix this in NFS, but no fix would be as simple as going back to ->iterate() to serialize nfs_readdir(), and neither fix I tested performed as well as going back to ->iterate(). The first required taking the directory's i_lock for each entry, with the result of terrible contention. The second way adds another flag to the nfs_inode, and so keeps the optimizations working for large directories. The difference from using ->iterate() here is that much more memory is consumed for a given workload without any performance gain. The workings of nfs_readdir() are such that concurrent users are serialized within read_cache_page() waiting to retrieve pages of entries from the server. By serializing this work in iterate_dir() instead, contention for cache pages is reduced. Waiting processes can have an uncontended pass at the entirety of the directory's pagecache once previous processes have completed filling it. v2 - Keep the bits needed for parallel lookup Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-03-11 05:07:46 +07:00
return read_cache_page(desc->file->f_mapping,
desc->page_index, (filler_t *)nfs_readdir_filler, desc);
}
/*
* Returns 0 if desc->dir_cookie was found on page desc->page_index
*/
static
int find_cache_page(nfs_readdir_descriptor_t *desc)
{
int res;
desc->page = get_cache_page(desc);
if (IS_ERR(desc->page))
return PTR_ERR(desc->page);
res = nfs_readdir_search_array(desc);
if (res != 0)
cache_page_release(desc);
return res;
}
/* Search for desc->dir_cookie from the beginning of the page cache */
static inline
int readdir_search_pagecache(nfs_readdir_descriptor_t *desc)
{
int res;
if (desc->page_index == 0) {
desc->current_index = 0;
desc->last_cookie = 0;
}
do {
res = find_cache_page(desc);
} while (res == -EAGAIN);
return res;
}
/*
* Once we've found the start of the dirent within a page: fill 'er up...
*/
static
int nfs_do_filldir(nfs_readdir_descriptor_t *desc)
{
struct file *file = desc->file;
int i = 0;
int res = 0;
struct nfs_cache_array *array = NULL;
struct nfs_open_dir_context *ctx = file->private_data;
array = kmap(desc->page);
for (i = desc->cache_entry_index; i < array->size; i++) {
struct nfs_cache_array_entry *ent;
ent = &array->array[i];
if (!dir_emit(desc->ctx, ent->string.name, ent->string.len,
nfs_compat_user_ino64(ent->ino), ent->d_type)) {
desc->eof = true;
break;
}
desc->ctx->pos++;
if (i < (array->size-1))
*desc->dir_cookie = array->array[i+1].cookie;
else
*desc->dir_cookie = array->last_cookie;
if (ctx->duped != 0)
ctx->duped = 1;
}
if (array->eof_index >= 0)
desc->eof = true;
kunmap(desc->page);
cache_page_release(desc);
dfprintk(DIRCACHE, "NFS: nfs_do_filldir() filling ended @ cookie %Lu; returning = %d\n",
(unsigned long long)*desc->dir_cookie, res);
return res;
}
/*
* If we cannot find a cookie in our cache, we suspect that this is
* because it points to a deleted file, so we ask the server to return
* whatever it thinks is the next entry. We then feed this to filldir.
* If all goes well, we should then be able to find our way round the
* cache on the next call to readdir_search_pagecache();
*
* NOTE: we cannot add the anonymous page to the pagecache because
* the data it contains might not be page aligned. Besides,
* we should already have a complete representation of the
* directory in the page cache by the time we get here.
*/
static inline
int uncached_readdir(nfs_readdir_descriptor_t *desc)
{
struct page *page = NULL;
int status;
struct inode *inode = file_inode(desc->file);
struct nfs_open_dir_context *ctx = desc->file->private_data;
dfprintk(DIRCACHE, "NFS: uncached_readdir() searching for cookie %Lu\n",
(unsigned long long)*desc->dir_cookie);
page = alloc_page(GFP_HIGHUSER);
if (!page) {
status = -ENOMEM;
goto out;
}
desc->page_index = 0;
desc->last_cookie = *desc->dir_cookie;
desc->page = page;
ctx->duped = 0;
status = nfs_readdir_xdr_to_array(desc, page, inode);
if (status < 0)
goto out_release;
status = nfs_do_filldir(desc);
out:
dfprintk(DIRCACHE, "NFS: %s: returns %d\n",
__func__, status);
return status;
out_release:
cache_page_release(desc);
goto out;
}
/* The file offset position represents the dirent entry number. A
last cookie cache takes care of the common case of reading the
whole directory.
*/
static int nfs_readdir(struct file *file, struct dir_context *ctx)
{
struct dentry *dentry = file_dentry(file);
struct inode *inode = d_inode(dentry);
nfs_readdir_descriptor_t my_desc,
*desc = &my_desc;
struct nfs_open_dir_context *dir_ctx = file->private_data;
int res = 0;
NFS: readdirplus optimization by cache mechanism When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount | wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-01-29 14:34:17 +07:00
int max_rapages = NFS_MAX_READDIR_RAPAGES;
dfprintk(FILE, "NFS: readdir(%pD2) starting at cookie %llu\n",
file, (long long)ctx->pos);
nfs_inc_stats(inode, NFSIOS_VFSGETDENTS);
/*
* ctx->pos points to the dirent entry number.
* *desc->dir_cookie has the cookie for the next entry. We have
* to either find the entry with the appropriate number or
* revalidate the cookie.
*/
memset(desc, 0, sizeof(*desc));
desc->file = file;
desc->ctx = ctx;
desc->dir_cookie = &dir_ctx->dir_cookie;
desc->decode = NFS_PROTO(inode)->decode_dirent;
desc->plus = nfs_use_readdirplus(inode, ctx);
NFS: readdirplus optimization by cache mechanism When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount | wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-01-29 14:34:17 +07:00
res = nfs_readdir_alloc_pages(desc->pvec.pages, max_rapages);
if (res < 0)
return -ENOMEM;
nfs_readdir_rapages_init(desc);
if (ctx->pos == 0 || nfs_attribute_cache_expired(inode))
res = nfs_revalidate_mapping(inode, file->f_mapping);
if (res < 0)
goto out;
do {
res = readdir_search_pagecache(desc);
if (res == -EBADCOOKIE) {
res = 0;
/* This means either end of directory */
if (*desc->dir_cookie && !desc->eof) {
/* Or that the server has 'lost' a cookie */
res = uncached_readdir(desc);
if (res == 0)
continue;
}
break;
}
if (res == -ETOOSMALL && desc->plus) {
clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inode)->flags);
nfs_zap_caches(inode);
desc->page_index = 0;
desc->plus = false;
desc->eof = false;
continue;
}
if (res < 0)
break;
res = nfs_do_filldir(desc);
if (res < 0)
break;
} while (!desc->eof);
out:
NFS: readdirplus optimization by cache mechanism When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount | wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-01-29 14:34:17 +07:00
nfs_readdir_free_pages(desc->pvec.pages, max_rapages);
if (res > 0)
res = 0;
dfprintk(FILE, "NFS: readdir(%pD2) returns %d\n", file, res);
return res;
}
static loff_t nfs_llseek_dir(struct file *filp, loff_t offset, int whence)
{
NFS: switch back to to ->iterate() NFS has some optimizations for readdir to choose between using READDIR or READDIRPLUS based on workload, and which NFS operation to use is determined by subsequent interactions with lookup, d_revalidate, and getattr. Concurrent use of nfs_readdir() via ->iterate_shared() can cause those optimizations to repeatedly invalidate the pagecache used to store directory entries during readdir(), which causes some very bad performance for directories with many entries (more than about 10000). There's a couple ways to fix this in NFS, but no fix would be as simple as going back to ->iterate() to serialize nfs_readdir(), and neither fix I tested performed as well as going back to ->iterate(). The first required taking the directory's i_lock for each entry, with the result of terrible contention. The second way adds another flag to the nfs_inode, and so keeps the optimizations working for large directories. The difference from using ->iterate() here is that much more memory is consumed for a given workload without any performance gain. The workings of nfs_readdir() are such that concurrent users are serialized within read_cache_page() waiting to retrieve pages of entries from the server. By serializing this work in iterate_dir() instead, contention for cache pages is reduced. Waiting processes can have an uncontended pass at the entirety of the directory's pagecache once previous processes have completed filling it. v2 - Keep the bits needed for parallel lookup Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-03-11 05:07:46 +07:00
struct inode *inode = file_inode(filp);
struct nfs_open_dir_context *dir_ctx = filp->private_data;
dfprintk(FILE, "NFS: llseek dir(%pD2, %lld, %d)\n",
filp, offset, whence);
switch (whence) {
default:
return -EINVAL;
case SEEK_SET:
if (offset < 0)
return -EINVAL;
inode_lock(inode);
break;
case SEEK_CUR:
if (offset == 0)
return filp->f_pos;
inode_lock(inode);
offset += filp->f_pos;
if (offset < 0) {
inode_unlock(inode);
return -EINVAL;
}
}
if (offset != filp->f_pos) {
filp->f_pos = offset;
dir_ctx->dir_cookie = 0;
dir_ctx->duped = 0;
}
NFS: switch back to to ->iterate() NFS has some optimizations for readdir to choose between using READDIR or READDIRPLUS based on workload, and which NFS operation to use is determined by subsequent interactions with lookup, d_revalidate, and getattr. Concurrent use of nfs_readdir() via ->iterate_shared() can cause those optimizations to repeatedly invalidate the pagecache used to store directory entries during readdir(), which causes some very bad performance for directories with many entries (more than about 10000). There's a couple ways to fix this in NFS, but no fix would be as simple as going back to ->iterate() to serialize nfs_readdir(), and neither fix I tested performed as well as going back to ->iterate(). The first required taking the directory's i_lock for each entry, with the result of terrible contention. The second way adds another flag to the nfs_inode, and so keeps the optimizations working for large directories. The difference from using ->iterate() here is that much more memory is consumed for a given workload without any performance gain. The workings of nfs_readdir() are such that concurrent users are serialized within read_cache_page() waiting to retrieve pages of entries from the server. By serializing this work in iterate_dir() instead, contention for cache pages is reduced. Waiting processes can have an uncontended pass at the entirety of the directory's pagecache once previous processes have completed filling it. v2 - Keep the bits needed for parallel lookup Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-03-11 05:07:46 +07:00
inode_unlock(inode);
return offset;
}
/*
* All directory operations under NFS are synchronous, so fsync()
* is a dummy operation.
*/
static int nfs_fsync_dir(struct file *filp, loff_t start, loff_t end,
int datasync)
{
struct inode *inode = file_inode(filp);
dfprintk(FILE, "NFS: fsync dir(%pD2) datasync %d\n", filp, datasync);
inode_lock(inode);
nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
inode_unlock(inode);
return 0;
}
/**
* nfs_force_lookup_revalidate - Mark the directory as having changed
* @dir: pointer to directory inode
*
* This forces the revalidation code in nfs_lookup_revalidate() to do a
* full lookup on all child dentries of 'dir' whenever a change occurs
* on the server that might have invalidated our dcache.
*
* The caller should be holding dir->i_lock
*/
void nfs_force_lookup_revalidate(struct inode *dir)
{
NFS_I(dir)->cache_change_attribute++;
}
EXPORT_SYMBOL_GPL(nfs_force_lookup_revalidate);
/*
* A check for whether or not the parent directory has changed.
* In the case it has, we assume that the dentries are untrustworthy
* and may need to be looked up again.
* If rcu_walk prevents us from performing a full check, return 0.
*/
static int nfs_check_verifier(struct inode *dir, struct dentry *dentry,
int rcu_walk)
{
if (IS_ROOT(dentry))
return 1;
if (NFS_SERVER(dir)->flags & NFS_MOUNT_LOOKUP_CACHE_NONE)
return 0;
if (!nfs_verify_change_attribute(dir, dentry->d_time))
return 0;
/* Revalidate nfsi->cache_change_attribute before we declare a match */
if (nfs_mapping_need_revalidate_inode(dir)) {
if (rcu_walk)
return 0;
if (__nfs_revalidate_inode(NFS_SERVER(dir), dir) < 0)
return 0;
}
if (!nfs_verify_change_attribute(dir, dentry->d_time))
return 0;
return 1;
}
/*
* Use intent information to check whether or not we're going to do
* an O_EXCL create using this path component.
*/
static int nfs_is_exclusive_create(struct inode *dir, unsigned int flags)
{
if (NFS_PROTO(dir)->version == 2)
return 0;
return flags & LOOKUP_EXCL;
}
/*
* Inode and filehandle revalidation for lookups.
*
* We force revalidation in the cases where the VFS sets LOOKUP_REVAL,
* or if the intent information indicates that we're about to open this
* particular file and the "nocto" mount flag is not set.
*
*/
static
int nfs_lookup_verify_inode(struct inode *inode, unsigned int flags)
{
struct nfs_server *server = NFS_SERVER(inode);
int ret;
if (IS_AUTOMOUNT(inode))
return 0;
if (flags & LOOKUP_OPEN) {
switch (inode->i_mode & S_IFMT) {
case S_IFREG:
/* A NFSv4 OPEN will revalidate later */
if (server->caps & NFS_CAP_ATOMIC_OPEN)
goto out;
/* Fallthrough */
case S_IFDIR:
if (server->flags & NFS_MOUNT_NOCTO)
break;
/* NFS close-to-open cache consistency validation */
goto out_force;
}
}
/* VFS wants an on-the-wire revalidation */
if (flags & LOOKUP_REVAL)
goto out_force;
out:
return (inode->i_nlink == 0) ? -ESTALE : 0;
out_force:
if (flags & LOOKUP_RCU)
return -ECHILD;
ret = __nfs_revalidate_inode(server, inode);
if (ret != 0)
return ret;
goto out;
}
/*
* We judge how long we want to trust negative
* dentries by looking at the parent inode mtime.
*
* If parent mtime has changed, we revalidate, else we wait for a
* period corresponding to the parent's attribute cache timeout value.
*
* If LOOKUP_RCU prevents us from performing a full check, return 1
* suggesting a reval is needed.
*
* Note that when creating a new file, or looking up a rename target,
* then it shouldn't be necessary to revalidate a negative dentry.
*/
static inline
int nfs_neg_need_reval(struct inode *dir, struct dentry *dentry,
unsigned int flags)
{
if (flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET))
return 0;
if (NFS_SERVER(dir)->flags & NFS_MOUNT_LOOKUP_CACHE_NONEG)
return 1;
return !nfs_check_verifier(dir, dentry, flags & LOOKUP_RCU);
}
static int
nfs_lookup_revalidate_done(struct inode *dir, struct dentry *dentry,
struct inode *inode, int error)
{
switch (error) {
case 1:
dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) is valid\n",
__func__, dentry);
return 1;
case 0:
nfs_mark_for_revalidate(dir);
if (inode && S_ISDIR(inode->i_mode)) {
/* Purge readdir caches. */
nfs_zap_caches(inode);
/*
* We can't d_drop the root of a disconnected tree:
* its d_hash is on the s_anon list and d_drop() would hide
* it from shrink_dcache_for_unmount(), leading to busy
* inodes on unmount and further oopses.
*/
if (IS_ROOT(dentry))
return 1;
}
dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) is invalid\n",
__func__, dentry);
return 0;
}
dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) lookup returned error %d\n",
__func__, dentry, error);
return error;
}
static int
nfs_lookup_revalidate_negative(struct inode *dir, struct dentry *dentry,
unsigned int flags)
{
int ret = 1;
if (nfs_neg_need_reval(dir, dentry, flags)) {
if (flags & LOOKUP_RCU)
return -ECHILD;
ret = 0;
}
return nfs_lookup_revalidate_done(dir, dentry, NULL, ret);
}
static int
nfs_lookup_revalidate_delegated(struct inode *dir, struct dentry *dentry,
struct inode *inode)
{
nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
return nfs_lookup_revalidate_done(dir, dentry, inode, 1);
}
static int
nfs_lookup_revalidate_dentry(struct inode *dir, struct dentry *dentry,
struct inode *inode)
{
struct nfs_fh *fhandle;
struct nfs_fattr *fattr;
struct nfs4_label *label;
int ret;
ret = -ENOMEM;
fhandle = nfs_alloc_fhandle();
fattr = nfs_alloc_fattr();
label = nfs4_label_alloc(NFS_SERVER(inode), GFP_KERNEL);
if (fhandle == NULL || fattr == NULL || IS_ERR(label))
goto out;
ret = NFS_PROTO(dir)->lookup(dir, &dentry->d_name, fhandle, fattr, label);
if (ret < 0) {
if (ret == -ESTALE || ret == -ENOENT)
ret = 0;
goto out;
}
ret = 0;
if (nfs_compare_fh(NFS_FH(inode), fhandle))
goto out;
if (nfs_refresh_inode(inode, fattr) < 0)
goto out;
nfs_setsecurity(inode, fattr, label);
nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
/* set a readdirplus hint that we had a cache miss */
nfs_force_use_readdirplus(dir);
ret = 1;
out:
nfs_free_fattr(fattr);
nfs_free_fhandle(fhandle);
nfs4_label_free(label);
return nfs_lookup_revalidate_done(dir, dentry, inode, ret);
}
/*
* This is called every time the dcache has a lookup hit,
* and we should check whether we can really trust that
* lookup.
*
* NOTE! The hit can be a negative hit too, don't assume
* we have an inode!
*
* If the parent directory is seen to have changed, we throw out the
* cached dentry and do a new lookup.
*/
static int
nfs_do_lookup_revalidate(struct inode *dir, struct dentry *dentry,
unsigned int flags)
{
struct inode *inode;
int error;
nfs_inc_stats(dir, NFSIOS_DENTRYREVALIDATE);
inode = d_inode(dentry);
if (!inode)
return nfs_lookup_revalidate_negative(dir, dentry, flags);
if (is_bad_inode(inode)) {
dfprintk(LOOKUPCACHE, "%s: %pd2 has dud inode\n",
__func__, dentry);
goto out_bad;
}
if (NFS_PROTO(dir)->have_delegation(inode, FMODE_READ))
return nfs_lookup_revalidate_delegated(dir, dentry, inode);
/* Force a full look up iff the parent directory has changed */
if (!(flags & (LOOKUP_EXCL | LOOKUP_REVAL)) &&
nfs_check_verifier(dir, dentry, flags & LOOKUP_RCU)) {
NFS: only invalidate dentrys that are clearly invalid. Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate") in v3.18, a return of '0' from ->d_revalidate() will cause the dentry to be invalidated even if it has filesystems mounted on or it or on a descendant. The mounted filesystem is unmounted. This means we need to be careful not to return 0 unless the directory referred to truly is invalid. So -ESTALE or -ENOENT should invalidate the directory. Other errors such a -EPERM or -ERESTARTSYS should be returned from ->d_revalidate() so they are propagated to the caller. A particular problem can be demonstrated by: 1/ mount an NFS filesystem using NFSv3 on /mnt 2/ mount any other filesystem on /mnt/foo 3/ ls /mnt/foo 4/ turn off network, or otherwise make the server unable to respond 5/ ls /mnt/foo & 6/ cat /proc/$!/stack # note that nfs_lookup_revalidate is in the call stack 7/ kill -9 $! # this results in -ERESTARTSYS being returned 8/ observe that /mnt/foo has been unmounted. This patch changes nfs_lookup_revalidate() to only treat -ESTALE from nfs_lookup_verify_inode() and -ESTALE or -ENOENT from ->lookup() as indicating an invalid inode. Other errors are returned. Also nfs_check_inode_attributes() is changed to return -ESTALE rather than -EIO. This is consistent with the error returned in similar circumstances from nfs_update_inode(). As this bug allows any user to unmount a filesystem mounted on an NFS filesystem, this fix is suitable for stable kernels. Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate") Cc: stable@vger.kernel.org (v3.18+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-05 09:22:20 +07:00
error = nfs_lookup_verify_inode(inode, flags);
if (error) {
if (error == -ESTALE)
nfs_zap_caches(dir);
goto out_bad;
}
nfs_advise_use_readdirplus(dir);
goto out_valid;
}
if (flags & LOOKUP_RCU)
return -ECHILD;
if (NFS_STALE(inode))
goto out_bad;
trace_nfs_lookup_revalidate_enter(dir, dentry, flags);
error = nfs_lookup_revalidate_dentry(dir, dentry, inode);
trace_nfs_lookup_revalidate_exit(dir, dentry, flags, error);
return error;
out_valid:
return nfs_lookup_revalidate_done(dir, dentry, inode, 1);
out_bad:
if (flags & LOOKUP_RCU)
return -ECHILD;
return nfs_lookup_revalidate_done(dir, dentry, inode, 0);
}
static int
__nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags,
int (*reval)(struct inode *, struct dentry *, unsigned int))
{
struct dentry *parent;
struct inode *dir;
int ret;
if (flags & LOOKUP_RCU) {
parent = READ_ONCE(dentry->d_parent);
dir = d_inode_rcu(parent);
if (!dir)
return -ECHILD;
ret = reval(dir, dentry, flags);
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24 04:07:29 +07:00
if (parent != READ_ONCE(dentry->d_parent))
return -ECHILD;
} else {
parent = dget_parent(dentry);
ret = reval(d_inode(parent), dentry, flags);
dput(parent);
}
return ret;
}
static int nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags)
{
return __nfs_lookup_revalidate(dentry, flags, nfs_do_lookup_revalidate);
}
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op The following set of operations on a NFS client and server will cause server# mkdir a client# cd a server# mv a a.bak client# sleep 30 # (or whatever the dir attrcache timeout is) client# stat . stat: cannot stat `.': Stale NFS file handle Obviously, we should not be getting an ESTALE error back there since the inode still exists on the server. The problem is that the lookup code will call d_revalidate on the dentry that "." refers to, because NFS has FS_REVAL_DOT set. nfs_lookup_revalidate will see that the parent directory has changed and will try to reverify the dentry by redoing a LOOKUP. That of course fails, so the lookup code returns ESTALE. The problem here is that d_revalidate is really a bad fit for this case. What we really want to know at this point is whether the inode is still good or not, but we don't really care what name it goes by or whether the dcache is still valid. Add a new d_op->d_weak_revalidate operation and have complete_walk call that instead of d_revalidate. The intent there is to allow for a "weaker" d_revalidate that just checks to see whether the inode is still good. This is also gives us an opportunity to kill off the FS_REVAL_DOT special casing. [AV: changed method name, added note in porting, fixed confusion re having it possibly called from RCU mode (it won't be)] Cc: NeilBrown <neilb@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-20 23:19:05 +07:00
/*
* A weaker form of d_revalidate for revalidating just the d_inode(dentry)
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op The following set of operations on a NFS client and server will cause server# mkdir a client# cd a server# mv a a.bak client# sleep 30 # (or whatever the dir attrcache timeout is) client# stat . stat: cannot stat `.': Stale NFS file handle Obviously, we should not be getting an ESTALE error back there since the inode still exists on the server. The problem is that the lookup code will call d_revalidate on the dentry that "." refers to, because NFS has FS_REVAL_DOT set. nfs_lookup_revalidate will see that the parent directory has changed and will try to reverify the dentry by redoing a LOOKUP. That of course fails, so the lookup code returns ESTALE. The problem here is that d_revalidate is really a bad fit for this case. What we really want to know at this point is whether the inode is still good or not, but we don't really care what name it goes by or whether the dcache is still valid. Add a new d_op->d_weak_revalidate operation and have complete_walk call that instead of d_revalidate. The intent there is to allow for a "weaker" d_revalidate that just checks to see whether the inode is still good. This is also gives us an opportunity to kill off the FS_REVAL_DOT special casing. [AV: changed method name, added note in porting, fixed confusion re having it possibly called from RCU mode (it won't be)] Cc: NeilBrown <neilb@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-20 23:19:05 +07:00
* when we don't really care about the dentry name. This is called when a
* pathwalk ends on a dentry that was not found via a normal lookup in the
* parent dir (e.g.: ".", "..", procfs symlinks or mountpoint traversals).
*
* In this situation, we just want to verify that the inode itself is OK
* since the dentry might have changed on the server.
*/
static int nfs_weak_revalidate(struct dentry *dentry, unsigned int flags)
{
struct inode *inode = d_inode(dentry);
int error = 0;
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op The following set of operations on a NFS client and server will cause server# mkdir a client# cd a server# mv a a.bak client# sleep 30 # (or whatever the dir attrcache timeout is) client# stat . stat: cannot stat `.': Stale NFS file handle Obviously, we should not be getting an ESTALE error back there since the inode still exists on the server. The problem is that the lookup code will call d_revalidate on the dentry that "." refers to, because NFS has FS_REVAL_DOT set. nfs_lookup_revalidate will see that the parent directory has changed and will try to reverify the dentry by redoing a LOOKUP. That of course fails, so the lookup code returns ESTALE. The problem here is that d_revalidate is really a bad fit for this case. What we really want to know at this point is whether the inode is still good or not, but we don't really care what name it goes by or whether the dcache is still valid. Add a new d_op->d_weak_revalidate operation and have complete_walk call that instead of d_revalidate. The intent there is to allow for a "weaker" d_revalidate that just checks to see whether the inode is still good. This is also gives us an opportunity to kill off the FS_REVAL_DOT special casing. [AV: changed method name, added note in porting, fixed confusion re having it possibly called from RCU mode (it won't be)] Cc: NeilBrown <neilb@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-20 23:19:05 +07:00
/*
* I believe we can only get a negative dentry here in the case of a
* procfs-style symlink. Just assume it's correct for now, but we may
* eventually need to do something more here.
*/
if (!inode) {
dfprintk(LOOKUPCACHE, "%s: %pd2 has negative inode\n",
__func__, dentry);
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op The following set of operations on a NFS client and server will cause server# mkdir a client# cd a server# mv a a.bak client# sleep 30 # (or whatever the dir attrcache timeout is) client# stat . stat: cannot stat `.': Stale NFS file handle Obviously, we should not be getting an ESTALE error back there since the inode still exists on the server. The problem is that the lookup code will call d_revalidate on the dentry that "." refers to, because NFS has FS_REVAL_DOT set. nfs_lookup_revalidate will see that the parent directory has changed and will try to reverify the dentry by redoing a LOOKUP. That of course fails, so the lookup code returns ESTALE. The problem here is that d_revalidate is really a bad fit for this case. What we really want to know at this point is whether the inode is still good or not, but we don't really care what name it goes by or whether the dcache is still valid. Add a new d_op->d_weak_revalidate operation and have complete_walk call that instead of d_revalidate. The intent there is to allow for a "weaker" d_revalidate that just checks to see whether the inode is still good. This is also gives us an opportunity to kill off the FS_REVAL_DOT special casing. [AV: changed method name, added note in porting, fixed confusion re having it possibly called from RCU mode (it won't be)] Cc: NeilBrown <neilb@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-20 23:19:05 +07:00
return 1;
}
if (is_bad_inode(inode)) {
dfprintk(LOOKUPCACHE, "%s: %pd2 has dud inode\n",
__func__, dentry);
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op The following set of operations on a NFS client and server will cause server# mkdir a client# cd a server# mv a a.bak client# sleep 30 # (or whatever the dir attrcache timeout is) client# stat . stat: cannot stat `.': Stale NFS file handle Obviously, we should not be getting an ESTALE error back there since the inode still exists on the server. The problem is that the lookup code will call d_revalidate on the dentry that "." refers to, because NFS has FS_REVAL_DOT set. nfs_lookup_revalidate will see that the parent directory has changed and will try to reverify the dentry by redoing a LOOKUP. That of course fails, so the lookup code returns ESTALE. The problem here is that d_revalidate is really a bad fit for this case. What we really want to know at this point is whether the inode is still good or not, but we don't really care what name it goes by or whether the dcache is still valid. Add a new d_op->d_weak_revalidate operation and have complete_walk call that instead of d_revalidate. The intent there is to allow for a "weaker" d_revalidate that just checks to see whether the inode is still good. This is also gives us an opportunity to kill off the FS_REVAL_DOT special casing. [AV: changed method name, added note in porting, fixed confusion re having it possibly called from RCU mode (it won't be)] Cc: NeilBrown <neilb@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-20 23:19:05 +07:00
return 0;
}
error = nfs_lookup_verify_inode(inode, flags);
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op The following set of operations on a NFS client and server will cause server# mkdir a client# cd a server# mv a a.bak client# sleep 30 # (or whatever the dir attrcache timeout is) client# stat . stat: cannot stat `.': Stale NFS file handle Obviously, we should not be getting an ESTALE error back there since the inode still exists on the server. The problem is that the lookup code will call d_revalidate on the dentry that "." refers to, because NFS has FS_REVAL_DOT set. nfs_lookup_revalidate will see that the parent directory has changed and will try to reverify the dentry by redoing a LOOKUP. That of course fails, so the lookup code returns ESTALE. The problem here is that d_revalidate is really a bad fit for this case. What we really want to know at this point is whether the inode is still good or not, but we don't really care what name it goes by or whether the dcache is still valid. Add a new d_op->d_weak_revalidate operation and have complete_walk call that instead of d_revalidate. The intent there is to allow for a "weaker" d_revalidate that just checks to see whether the inode is still good. This is also gives us an opportunity to kill off the FS_REVAL_DOT special casing. [AV: changed method name, added note in porting, fixed confusion re having it possibly called from RCU mode (it won't be)] Cc: NeilBrown <neilb@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-20 23:19:05 +07:00
dfprintk(LOOKUPCACHE, "NFS: %s: inode %lu is %s\n",
__func__, inode->i_ino, error ? "invalid" : "valid");
return !error;
}
/*
* This is called from dput() when d_count is going to 0.
*/
static int nfs_dentry_delete(const struct dentry *dentry)
{
dfprintk(VFS, "NFS: dentry_delete(%pd2, %x)\n",
dentry, dentry->d_flags);
/* Unhash any dentry with a stale inode */
if (d_really_is_positive(dentry) && NFS_STALE(d_inode(dentry)))
return 1;
if (dentry->d_flags & DCACHE_NFSFS_RENAMED) {
/* Unhash it, so that ->d_iput() would be called */
return 1;
}
Rename superblock flags (MS_xyz -> SB_xyz) This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-28 04:05:09 +07:00
if (!(dentry->d_sb->s_flags & SB_ACTIVE)) {
/* Unhash it, so that ancestors of killed async unlink
* files will be cleaned up during umount */
return 1;
}
return 0;
}
NFS: Fix calls to drop_nlink() It is almost always wrong for NFS to call drop_nlink() after removing a file. What we really want is to mark the inode's attributes for revalidation, and we want to ensure that the VFS drops it if we're reasonably sure that this is the final unlink(). Do the former using the usual cache validity flags, and the latter by testing if inode->i_nlink == 1, and clearing it in that case. This also fixes the following warning reported by Neil Brown and Jeff Layton (among others). [634155.004438] WARNING: at /home/abuild/rpmbuild/BUILD/kernel-desktop-3.5.0/lin [634155.004442] Hardware name: Latitude E6510 [634155.004577] crc_itu_t crc32c_intel snd_hwdep snd_pcm snd_timer snd soundcor [634155.004609] Pid: 13402, comm: bash Tainted: G W 3.5.0-36-desktop # [634155.004611] Call Trace: [634155.004630] [<ffffffff8100444a>] dump_trace+0xaa/0x2b0 [634155.004641] [<ffffffff815a23dc>] dump_stack+0x69/0x6f [634155.004653] [<ffffffff81041a0b>] warn_slowpath_common+0x7b/0xc0 [634155.004662] [<ffffffff811832e4>] drop_nlink+0x34/0x40 [634155.004687] [<ffffffffa05bb6c3>] nfs_dentry_iput+0x33/0x70 [nfs] [634155.004714] [<ffffffff8118049e>] dput+0x12e/0x230 [634155.004726] [<ffffffff8116b230>] __fput+0x170/0x230 [634155.004735] [<ffffffff81167c0f>] filp_close+0x5f/0x90 [634155.004743] [<ffffffff81167cd7>] sys_close+0x97/0x100 [634155.004754] [<ffffffff815c3b39>] system_call_fastpath+0x16/0x1b [634155.004767] [<00007f2a73a0d110>] 0x7f2a73a0d10f Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org [3.3+]
2012-12-15 04:38:46 +07:00
/* Ensure that we revalidate inode->i_nlink */
static void nfs_drop_nlink(struct inode *inode)
{
spin_lock(&inode->i_lock);
NFS: Fix calls to drop_nlink() It is almost always wrong for NFS to call drop_nlink() after removing a file. What we really want is to mark the inode's attributes for revalidation, and we want to ensure that the VFS drops it if we're reasonably sure that this is the final unlink(). Do the former using the usual cache validity flags, and the latter by testing if inode->i_nlink == 1, and clearing it in that case. This also fixes the following warning reported by Neil Brown and Jeff Layton (among others). [634155.004438] WARNING: at /home/abuild/rpmbuild/BUILD/kernel-desktop-3.5.0/lin [634155.004442] Hardware name: Latitude E6510 [634155.004577] crc_itu_t crc32c_intel snd_hwdep snd_pcm snd_timer snd soundcor [634155.004609] Pid: 13402, comm: bash Tainted: G W 3.5.0-36-desktop # [634155.004611] Call Trace: [634155.004630] [<ffffffff8100444a>] dump_trace+0xaa/0x2b0 [634155.004641] [<ffffffff815a23dc>] dump_stack+0x69/0x6f [634155.004653] [<ffffffff81041a0b>] warn_slowpath_common+0x7b/0xc0 [634155.004662] [<ffffffff811832e4>] drop_nlink+0x34/0x40 [634155.004687] [<ffffffffa05bb6c3>] nfs_dentry_iput+0x33/0x70 [nfs] [634155.004714] [<ffffffff8118049e>] dput+0x12e/0x230 [634155.004726] [<ffffffff8116b230>] __fput+0x170/0x230 [634155.004735] [<ffffffff81167c0f>] filp_close+0x5f/0x90 [634155.004743] [<ffffffff81167cd7>] sys_close+0x97/0x100 [634155.004754] [<ffffffff815c3b39>] system_call_fastpath+0x16/0x1b [634155.004767] [<00007f2a73a0d110>] 0x7f2a73a0d10f Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org [3.3+]
2012-12-15 04:38:46 +07:00
/* drop the inode if we're reasonably sure this is the last link */
if (inode->i_nlink > 0)
drop_nlink(inode);
NFS_I(inode)->attr_gencount = nfs_inc_attr_generation_counter();
NFS_I(inode)->cache_validity |= NFS_INO_INVALID_CHANGE
| NFS_INO_INVALID_CTIME
| NFS_INO_INVALID_OTHER
| NFS_INO_REVAL_FORCED;
spin_unlock(&inode->i_lock);
}
/*
* Called when the dentry loses inode.
* We use it to clean up silly-renamed files.
*/
static void nfs_dentry_iput(struct dentry *dentry, struct inode *inode)
{
NFS: Fix directory caching problem - with test case and patch. Try running this script in an NFS mounted directory (Client relatively recent - 2.6.18 has the problem as does 2.6.20). ------------------------------------------------------ #!/bin/bash # # This script will produce the following errormessage from tar: # # tar: newdir/innerdir/innerfile: file changed as we read it # create dirs rm -rf nfstest mkdir -p nfstest/dir/innerdir # create files (should not be empty) echo "Hello World!" >nfstest/dir/file echo "Hello World!" >nfstest/dir/innerdir/innerfile # problem only happens if we sleep before chmod sleep 1 # change file modes chmod -R a+r nfstest # rename dir mv nfstest/dir nfstest/newdir # tar it tar -cf nfstest/nfstest.tar -C nfstest newdir # restore old dir name mv nfstest/newdir nfstest/dir -------------------------------------------------------- What happens: The 'chmod -R' does a readdir_plus in each directory and the results get cached in the page cache. It then updates the ctime on each file by one second. When this happens, the post-op attributes are used to update the ctime stored on the client to match the value in the kernel. The 'mv' calls shrink_dcache_parent on the directory tree which flushes all the dentries (so a new lookup will be required) but doesn't flush the inodes or pagecache. The 'tar' does a readdir on each directory, but (in the case of 'innerdir' at least) satisfies it from the pagecache and uses the READDIRPLUS data to update all the inodes. In the case of 'innerdir/innerfile', the ctime is out of date. 'tar' then calls 'lstat' on innerdir/innerfile getting an old ctime. It then opens the file (triggering a GETATTR), reads the content, and then calls fstat to see if anything has changed. It finds that ctime has changed and so complains. The problem seems to be that the cache readdirplus info is kept around for too long. My patch below discards pagecache data for directories when dentry_iput is called on them. This effectively removes the symptom which convinces me that I correctly understand the problem. However I'm not convinced that is a proper solution, as there could easily be other races that trigger the same problem without being affected by this 'fix'. One possibility would be to require that readdirplus pagecache data be only used *once* to instantiate an inode. Somehow it should then be invalidated so that if the dentry subsequently disappears, it will cause a new request to the server to fill in the stat data. Another possibility is to compare the cache_change_attribute on the inode with something similar for the readdirplus info and reject the info from readdirplus if it is too old. I haven't tried to implement these and would value other opinions before I do. Thanks, NeilBrown Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2007-02-26 08:48:25 +07:00
if (S_ISDIR(inode->i_mode))
/* drop any readdir cache as it could easily be old */
NFS_I(inode)->cache_validity |= NFS_INO_INVALID_DATA;
if (dentry->d_flags & DCACHE_NFSFS_RENAMED) {
nfs_complete_unlink(dentry, inode);
NFS: Fix calls to drop_nlink() It is almost always wrong for NFS to call drop_nlink() after removing a file. What we really want is to mark the inode's attributes for revalidation, and we want to ensure that the VFS drops it if we're reasonably sure that this is the final unlink(). Do the former using the usual cache validity flags, and the latter by testing if inode->i_nlink == 1, and clearing it in that case. This also fixes the following warning reported by Neil Brown and Jeff Layton (among others). [634155.004438] WARNING: at /home/abuild/rpmbuild/BUILD/kernel-desktop-3.5.0/lin [634155.004442] Hardware name: Latitude E6510 [634155.004577] crc_itu_t crc32c_intel snd_hwdep snd_pcm snd_timer snd soundcor [634155.004609] Pid: 13402, comm: bash Tainted: G W 3.5.0-36-desktop # [634155.004611] Call Trace: [634155.004630] [<ffffffff8100444a>] dump_trace+0xaa/0x2b0 [634155.004641] [<ffffffff815a23dc>] dump_stack+0x69/0x6f [634155.004653] [<ffffffff81041a0b>] warn_slowpath_common+0x7b/0xc0 [634155.004662] [<ffffffff811832e4>] drop_nlink+0x34/0x40 [634155.004687] [<ffffffffa05bb6c3>] nfs_dentry_iput+0x33/0x70 [nfs] [634155.004714] [<ffffffff8118049e>] dput+0x12e/0x230 [634155.004726] [<ffffffff8116b230>] __fput+0x170/0x230 [634155.004735] [<ffffffff81167c0f>] filp_close+0x5f/0x90 [634155.004743] [<ffffffff81167cd7>] sys_close+0x97/0x100 [634155.004754] [<ffffffff815c3b39>] system_call_fastpath+0x16/0x1b [634155.004767] [<00007f2a73a0d110>] 0x7f2a73a0d10f Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org [3.3+]
2012-12-15 04:38:46 +07:00
nfs_drop_nlink(inode);
}
iput(inode);
}
static void nfs_d_release(struct dentry *dentry)
{
/* free cached devname value, if it survived that far */
if (unlikely(dentry->d_fsdata)) {
if (dentry->d_flags & DCACHE_NFSFS_RENAMED)
WARN_ON(1);
else
kfree(dentry->d_fsdata);
}
}
const struct dentry_operations nfs_dentry_operations = {
.d_revalidate = nfs_lookup_revalidate,
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op The following set of operations on a NFS client and server will cause server# mkdir a client# cd a server# mv a a.bak client# sleep 30 # (or whatever the dir attrcache timeout is) client# stat . stat: cannot stat `.': Stale NFS file handle Obviously, we should not be getting an ESTALE error back there since the inode still exists on the server. The problem is that the lookup code will call d_revalidate on the dentry that "." refers to, because NFS has FS_REVAL_DOT set. nfs_lookup_revalidate will see that the parent directory has changed and will try to reverify the dentry by redoing a LOOKUP. That of course fails, so the lookup code returns ESTALE. The problem here is that d_revalidate is really a bad fit for this case. What we really want to know at this point is whether the inode is still good or not, but we don't really care what name it goes by or whether the dcache is still valid. Add a new d_op->d_weak_revalidate operation and have complete_walk call that instead of d_revalidate. The intent there is to allow for a "weaker" d_revalidate that just checks to see whether the inode is still good. This is also gives us an opportunity to kill off the FS_REVAL_DOT special casing. [AV: changed method name, added note in porting, fixed confusion re having it possibly called from RCU mode (it won't be)] Cc: NeilBrown <neilb@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-20 23:19:05 +07:00
.d_weak_revalidate = nfs_weak_revalidate,
.d_delete = nfs_dentry_delete,
.d_iput = nfs_dentry_iput,
.d_automount = nfs_d_automount,
.d_release = nfs_d_release,
};
EXPORT_SYMBOL_GPL(nfs_dentry_operations);
struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned int flags)
{
struct dentry *res;
struct inode *inode = NULL;
struct nfs_fh *fhandle = NULL;
struct nfs_fattr *fattr = NULL;
struct nfs4_label *label = NULL;
int error;
dfprintk(VFS, "NFS: lookup(%pd2)\n", dentry);
nfs_inc_stats(dir, NFSIOS_VFSLOOKUP);
if (unlikely(dentry->d_name.len > NFS_SERVER(dir)->namelen))
return ERR_PTR(-ENAMETOOLONG);
/*
* If we're doing an exclusive create, optimize away the lookup
* but don't hash the dentry.
*/
if (nfs_is_exclusive_create(dir, flags) || flags & LOOKUP_RENAME_TARGET)
return NULL;
res = ERR_PTR(-ENOMEM);
fhandle = nfs_alloc_fhandle();
fattr = nfs_alloc_fattr();
if (fhandle == NULL || fattr == NULL)
goto out;
label = nfs4_label_alloc(NFS_SERVER(dir), GFP_NOWAIT);
if (IS_ERR(label))
goto out;
trace_nfs_lookup_enter(dir, dentry, flags);
error = NFS_PROTO(dir)->lookup(dir, &dentry->d_name, fhandle, fattr, label);
if (error == -ENOENT)
goto no_entry;
if (error < 0) {
res = ERR_PTR(error);
goto out_label;
}
inode = nfs_fhget(dentry->d_sb, fhandle, fattr, label);
res = ERR_CAST(inode);
if (IS_ERR(res))
goto out_label;
NFS: Share NFS superblocks per-protocol per-server per-FSID The attached patch makes NFS share superblocks between mounts from the same server and FSID over the same protocol. It does this by creating each superblock with a false root and returning the real root dentry in the vfsmount presented by get_sb(). The root dentry set starts off as an anonymous dentry if we don't already have the dentry for its inode, otherwise it simply returns the dentry we already have. We may thus end up with several trees of dentries in the superblock, and if at some later point one of anonymous tree roots is discovered by normal filesystem activity to be located in another tree within the superblock, the anonymous root is named and materialises attached to the second tree at the appropriate point. Why do it this way? Why not pass an extra argument to the mount() syscall to indicate the subpath and then pathwalk from the server root to the desired directory? You can't guarantee this will work for two reasons: (1) The root and intervening nodes may not be accessible to the client. With NFS2 and NFS3, for instance, mountd is called on the server to get the filehandle for the tip of a path. mountd won't give us handles for anything we don't have permission to access, and so we can't set up NFS inodes for such nodes, and so can't easily set up dentries (we'd have to have ghost inodes or something). With this patch we don't actually create dentries until we get handles from the server that we can use to set up their inodes, and we don't actually bind them into the tree until we know for sure where they go. (2) Inaccessible symbolic links. If we're asked to mount two exports from the server, eg: mount warthog:/warthog/aaa/xxx /mmm mount warthog:/warthog/bbb/yyy /nnn We may not be able to access anything nearer the root than xxx and yyy, but we may find out later that /mmm/www/yyy, say, is actually the same directory as the one mounted on /nnn. What we might then find out, for example, is that /warthog/bbb was actually a symbolic link to /warthog/aaa/xxx/www, but we can't actually determine that by talking to the server until /warthog is made available by NFS. This would lead to having constructed an errneous dentry tree which we can't easily fix. We can end up with a dentry marked as a directory when it should actually be a symlink, or we could end up with an apparently hardlinked directory. With this patch we need not make assumptions about the type of a dentry for which we can't retrieve information, nor need we assume we know its place in the grand scheme of things until we actually see that place. This patch reduces the possibility of aliasing in the inode and page caches for inodes that may be accessed by more than one NFS export. It also reduces the number of superblocks required for NFS where there are many NFS exports being used from a server (home directory server + autofs for example). This in turn makes it simpler to do local caching of network filesystems, as it can then be guaranteed that there won't be links from multiple inodes in separate superblocks to the same cache file. Obviously, cache aliasing between different levels of NFS protocol could still be a problem, but at least that gives us another key to use when indexing the cache. This patch makes the following changes: (1) The server record construction/destruction has been abstracted out into its own set of functions to make things easier to get right. These have been moved into fs/nfs/client.c. All the code in fs/nfs/client.c has to do with the management of connections to servers, and doesn't touch superblocks in any way; the remaining code in fs/nfs/super.c has to do with VFS superblock management. (2) The sequence of events undertaken by NFS mount is now reordered: (a) A volume representation (struct nfs_server) is allocated. (b) A server representation (struct nfs_client) is acquired. This may be allocated or shared, and is keyed on server address, port and NFS version. (c) If allocated, the client representation is initialised. The state member variable of nfs_client is used to prevent a race during initialisation from two mounts. (d) For NFS4 a simple pathwalk is performed, walking from FH to FH to find the root filehandle for the mount (fs/nfs/getroot.c). For NFS2/3 we are given the root FH in advance. (e) The volume FSID is probed for on the root FH. (f) The volume representation is initialised from the FSINFO record retrieved on the root FH. (g) sget() is called to acquire a superblock. This may be allocated or shared, keyed on client pointer and FSID. (h) If allocated, the superblock is initialised. (i) If the superblock is shared, then the new nfs_server record is discarded. (j) The root dentry for this mount is looked up from the root FH. (k) The root dentry for this mount is assigned to the vfsmount. (3) nfs_readdir_lookup() creates dentries for each of the entries readdir() returns; this function now attaches disconnected trees from alternate roots that happen to be discovered attached to a directory being read (in the same way nfs_lookup() is made to do for lookup ops). The new d_materialise_unique() function is now used to do this, thus permitting the whole thing to be done under one set of locks, and thus avoiding any race between mount and lookup operations on the same directory. (4) The client management code uses a new debug facility: NFSDBG_CLIENT which is set by echoing 1024 to /proc/net/sunrpc/nfs_debug. (5) Clone mounts are now called xdev mounts. (6) Use the dentry passed to the statfs() op as the handle for retrieving fs statistics rather than the root dentry of the superblock (which is now a dummy). Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2006-08-23 07:06:13 +07:00
/* Notify readdir to use READDIRPLUS */
nfs_force_use_readdirplus(dir);
no_entry:
res = d_splice_alias(inode, dentry);
if (res != NULL) {
if (IS_ERR(res))
goto out_label;
dentry = res;
}
nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
out_label:
trace_nfs_lookup_exit(dir, dentry, flags, error);
nfs4_label_free(label);
out:
nfs_free_fattr(fattr);
nfs_free_fhandle(fhandle);
return res;
}
EXPORT_SYMBOL_GPL(nfs_lookup);
#if IS_ENABLED(CONFIG_NFS_V4)
static int nfs4_lookup_revalidate(struct dentry *, unsigned int);
const struct dentry_operations nfs4_dentry_operations = {
.d_revalidate = nfs4_lookup_revalidate,
.d_weak_revalidate = nfs_weak_revalidate,
.d_delete = nfs_dentry_delete,
.d_iput = nfs_dentry_iput,
.d_automount = nfs_d_automount,
.d_release = nfs_d_release,
};
EXPORT_SYMBOL_GPL(nfs4_dentry_operations);
static fmode_t flags_to_mode(int flags)
{
fmode_t res = (__force fmode_t)flags & FMODE_EXEC;
if ((flags & O_ACCMODE) != O_WRONLY)
res |= FMODE_READ;
if ((flags & O_ACCMODE) != O_RDONLY)
res |= FMODE_WRITE;
return res;
}
2016-10-13 11:26:47 +07:00
static struct nfs_open_context *create_nfs_open_context(struct dentry *dentry, int open_flags, struct file *filp)
{
2016-10-13 11:26:47 +07:00
return alloc_nfs_open_context(dentry, flags_to_mode(open_flags), filp);
}
static int do_open(struct inode *inode, struct file *filp)
{
NFS: Use i_writecount to control whether to get an fscache cookie in nfs_open() Use i_writecount to control whether to get an fscache cookie in nfs_open() as NFS does not do write caching yet. I *think* this is the cause of a problem encountered by Mark Moseley whereby __fscache_uncache_page() gets a NULL pointer dereference because cookie->def is NULL: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: [<ffffffff812a1903>] __fscache_uncache_page+0x23/0x160 PGD 0 Thread overran stack, or stack corrupted Oops: 0000 [#1] SMP Modules linked in: ... CPU: 7 PID: 18993 Comm: php Not tainted 3.11.1 #1 Hardware name: Dell Inc. PowerEdge R420/072XWF, BIOS 1.3.5 08/21/2012 task: ffff8804203460c0 ti: ffff880420346640 RIP: 0010:[<ffffffff812a1903>] __fscache_uncache_page+0x23/0x160 RSP: 0018:ffff8801053af878 EFLAGS: 00210286 RAX: 0000000000000000 RBX: ffff8800be2f8780 RCX: ffff88022ffae5e8 RDX: 0000000000004c66 RSI: ffffea00055ff440 RDI: ffff8800be2f8780 RBP: ffff8801053af898 R08: 0000000000000001 R09: 0000000000000003 R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00055ff440 R13: 0000000000001000 R14: ffff8800c50be538 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88042fc60000(0063) knlGS:00000000e439c700 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000000000010 CR3: 0000000001d8f000 CR4: 00000000000607f0 Stack: ... Call Trace: [<ffffffff81365a72>] __nfs_fscache_invalidate_page+0x42/0x70 [<ffffffff813553d5>] nfs_invalidate_page+0x75/0x90 [<ffffffff811b8f5e>] truncate_inode_page+0x8e/0x90 [<ffffffff811b90ad>] truncate_inode_pages_range.part.12+0x14d/0x620 [<ffffffff81d6387d>] ? __mutex_lock_slowpath+0x1fd/0x2e0 [<ffffffff811b95d3>] truncate_inode_pages_range+0x53/0x70 [<ffffffff811b969d>] truncate_inode_pages+0x2d/0x40 [<ffffffff811b96ff>] truncate_pagecache+0x4f/0x70 [<ffffffff81356840>] nfs_setattr_update_inode+0xa0/0x120 [<ffffffff81368de4>] nfs3_proc_setattr+0xc4/0xe0 [<ffffffff81357f78>] nfs_setattr+0xc8/0x150 [<ffffffff8122d95b>] notify_change+0x1cb/0x390 [<ffffffff8120a55b>] do_truncate+0x7b/0xc0 [<ffffffff8121f96c>] do_last+0xa4c/0xfd0 [<ffffffff8121ffbc>] path_openat+0xcc/0x670 [<ffffffff81220a0e>] do_filp_open+0x4e/0xb0 [<ffffffff8120ba1f>] do_sys_open+0x13f/0x2b0 [<ffffffff8126aaf6>] compat_SyS_open+0x36/0x50 [<ffffffff81d7204c>] sysenter_dispatch+0x7/0x24 The code at the instruction pointer was disassembled: > (gdb) disas __fscache_uncache_page > Dump of assembler code for function __fscache_uncache_page: > ... > 0xffffffff812a18ff <+31>: mov 0x48(%rbx),%rax > 0xffffffff812a1903 <+35>: cmpb $0x0,0x10(%rax) > 0xffffffff812a1907 <+39>: je 0xffffffff812a19cd <__fscache_uncache_page+237> These instructions make up: ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX); That cmpb is the faulting instruction (%rax is 0). So cookie->def is NULL - which presumably means that the cookie has already been at least partway through __fscache_relinquish_cookie(). What I think may be happening is something like a three-way race on the same file: PROCESS 1 PROCESS 2 PROCESS 3 =============== =============== =============== open(O_TRUNC|O_WRONLY) open(O_RDONLY) open(O_WRONLY) -->nfs_open() -->nfs_fscache_set_inode_cookie() nfs_fscache_inode_lock() nfs_fscache_disable_inode_cookie() __fscache_relinquish_cookie() nfs_inode->fscache = NULL <--nfs_fscache_set_inode_cookie() -->nfs_open() -->nfs_fscache_set_inode_cookie() nfs_fscache_inode_lock() nfs_fscache_enable_inode_cookie() __fscache_acquire_cookie() nfs_inode->fscache = cookie <--nfs_fscache_set_inode_cookie() <--nfs_open() -->nfs_setattr() ... ... -->nfs_invalidate_page() -->__nfs_fscache_invalidate_page() cookie = nfsi->fscache -->nfs_open() -->nfs_fscache_set_inode_cookie() nfs_fscache_inode_lock() nfs_fscache_disable_inode_cookie() -->__fscache_relinquish_cookie() -->__fscache_uncache_page(cookie) <crash> <--__fscache_relinquish_cookie() nfs_inode->fscache = NULL <--nfs_fscache_set_inode_cookie() What is needed is something to prevent process #2 from reacquiring the cookie - and I think checking i_writecount should do the trick. It's also possible to have a two-way race on this if the file is opened O_TRUNC|O_RDONLY instead. Reported-by: Mark Moseley <moseleymark@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com>
2013-09-27 17:20:03 +07:00
nfs_fscache_open_file(inode, filp);
return 0;
}
static int nfs_finish_open(struct nfs_open_context *ctx,
struct dentry *dentry,
struct file *file, unsigned open_flags)
{
int err;
err = finish_open(file, dentry, do_open);
if (err)
goto out;
if (S_ISREG(file->f_path.dentry->d_inode->i_mode))
nfs_file_set_open_context(file, ctx);
else
err = -ESTALE;
out:
return err;
}
int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
struct file *file, unsigned open_flags,
umode_t mode)
{
DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
struct nfs_open_context *ctx;
struct dentry *res;
struct iattr attr = { .ia_valid = ATTR_OPEN };
struct inode *inode;
unsigned int lookup_flags = 0;
bool switched = false;
int created = 0;
int err;
/* Expect a negative dentry */
BUG_ON(d_inode(dentry));
dfprintk(VFS, "NFS: atomic_open(%s/%lu), %pd\n",
dir->i_sb->s_id, dir->i_ino, dentry);
err = nfs_check_flags(open_flags);
if (err)
return err;
/* NFS only supports OPEN on regular files */
if ((open_flags & O_DIRECTORY)) {
if (!d_in_lookup(dentry)) {
/*
* Hashed negative dentry with O_DIRECTORY: dentry was
* revalidated and is fine, no need to perform lookup
* again
*/
return -ENOENT;
}
lookup_flags = LOOKUP_OPEN|LOOKUP_DIRECTORY;
goto no_open;
}
if (dentry->d_name.len > NFS_SERVER(dir)->namelen)
return -ENAMETOOLONG;
if (open_flags & O_CREAT) {
struct nfs_server *server = NFS_SERVER(dir);
if (!(server->attr_bitmask[2] & FATTR4_WORD2_MODE_UMASK))
mode &= ~current_umask();
attr.ia_valid |= ATTR_MODE;
attr.ia_mode = mode;
}
if (open_flags & O_TRUNC) {
attr.ia_valid |= ATTR_SIZE;
attr.ia_size = 0;
}
if (!(open_flags & O_CREAT) && !d_in_lookup(dentry)) {
d_drop(dentry);
switched = true;
dentry = d_alloc_parallel(dentry->d_parent,
&dentry->d_name, &wq);
if (IS_ERR(dentry))
return PTR_ERR(dentry);
if (unlikely(!d_in_lookup(dentry)))
return finish_no_open(file, dentry);
}
2016-10-13 11:26:47 +07:00
ctx = create_nfs_open_context(dentry, open_flags, file);
err = PTR_ERR(ctx);
if (IS_ERR(ctx))
goto out;
trace_nfs_atomic_open_enter(dir, ctx, open_flags);
inode = NFS_PROTO(dir)->open_context(dir, ctx, open_flags, &attr, &created);
if (created)
file->f_mode |= FMODE_CREATED;
if (IS_ERR(inode)) {
err = PTR_ERR(inode);
trace_nfs_atomic_open_exit(dir, ctx, open_flags, err);
put_nfs_open_context(ctx);
d_drop(dentry);
switch (err) {
case -ENOENT:
d_splice_alias(NULL, dentry);
nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
break;
case -EISDIR:
case -ENOTDIR:
goto no_open;
case -ELOOP:
if (!(open_flags & O_NOFOLLOW))
goto no_open;
break;
/* case -EINVAL: */
default:
break;
}
goto out;
}
err = nfs_finish_open(ctx, ctx->dentry, file, open_flags);
trace_nfs_atomic_open_exit(dir, ctx, open_flags, err);
put_nfs_open_context(ctx);
out:
if (unlikely(switched)) {
d_lookup_done(dentry);
dput(dentry);
}
return err;
no_open:
res = nfs_lookup(dir, dentry, lookup_flags);
if (switched) {
d_lookup_done(dentry);
if (!res)
res = dentry;
else
dput(dentry);
}
if (IS_ERR(res))
return PTR_ERR(res);
return finish_no_open(file, res);
}
EXPORT_SYMBOL_GPL(nfs_atomic_open);
static int
nfs4_do_lookup_revalidate(struct inode *dir, struct dentry *dentry,
unsigned int flags)
{
struct inode *inode;
if (!(flags & LOOKUP_OPEN) || (flags & LOOKUP_DIRECTORY))
goto full_reval;
if (d_mountpoint(dentry))
goto full_reval;
inode = d_inode(dentry);
/* We can't create new files in nfs_open_revalidate(), so we
* optimize away revalidation of negative dentries.
*/
if (inode == NULL)
goto full_reval;
if (NFS_PROTO(dir)->have_delegation(inode, FMODE_READ))
return nfs_lookup_revalidate_delegated(dir, dentry, inode);
/* NFS only supports OPEN on regular files */
if (!S_ISREG(inode->i_mode))
goto full_reval;
/* We cannot do exclusive creation on a positive dentry */
if (flags & (LOOKUP_EXCL | LOOKUP_REVAL))
goto reval_dentry;
/* Check if the directory changed */
if (!nfs_check_verifier(dir, dentry, flags & LOOKUP_RCU))
goto reval_dentry;
/* Let f_op->open() actually open (and revalidate) the file */
return 1;
reval_dentry:
if (flags & LOOKUP_RCU)
return -ECHILD;
return nfs_lookup_revalidate_dentry(dir, dentry, inode);
full_reval:
return nfs_do_lookup_revalidate(dir, dentry, flags);
}
static int nfs4_lookup_revalidate(struct dentry *dentry, unsigned int flags)
{
return __nfs_lookup_revalidate(dentry, flags,
nfs4_do_lookup_revalidate);
}
#endif /* CONFIG_NFSV4 */
/*
* Code common to create, mkdir, and mknod.
*/
int nfs_instantiate(struct dentry *dentry, struct nfs_fh *fhandle,
struct nfs_fattr *fattr,
struct nfs4_label *label)
{
struct dentry *parent = dget_parent(dentry);
struct inode *dir = d_inode(parent);
struct inode *inode;
struct dentry *d;
int error = -EACCES;
d_drop(dentry);
/* We may have been initialized further down */
if (d_really_is_positive(dentry))
goto out;
if (fhandle->size == 0) {
error = NFS_PROTO(dir)->lookup(dir, &dentry->d_name, fhandle, fattr, NULL);
if (error)
goto out_error;
}
nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
if (!(fattr->valid & NFS_ATTR_FATTR)) {
struct nfs_server *server = NFS_SB(dentry->d_sb);
error = server->nfs_client->rpc_ops->getattr(server, fhandle,
fattr, NULL, NULL);
if (error < 0)
goto out_error;
}
inode = nfs_fhget(dentry->d_sb, fhandle, fattr, label);
d = d_splice_alias(inode, dentry);
if (IS_ERR(d)) {
error = PTR_ERR(d);
goto out_error;
}
dput(d);
out:
dput(parent);
return 0;
out_error:
nfs_mark_for_revalidate(dir);
dput(parent);
return error;
}
EXPORT_SYMBOL_GPL(nfs_instantiate);
/*
* Following a failed create operation, we drop the dentry rather
* than retain a negative dentry. This avoids a problem in the event
* that the operation succeeded on the server, but an error in the
* reply path made it appear to have failed.
*/
int nfs_create(struct inode *dir, struct dentry *dentry,
umode_t mode, bool excl)
{
struct iattr attr;
int open_flags = excl ? O_CREAT | O_EXCL : O_CREAT;
int error;
dfprintk(VFS, "NFS: create(%s/%lu), %pd\n",
dir->i_sb->s_id, dir->i_ino, dentry);
attr.ia_mode = mode;
attr.ia_valid = ATTR_MODE;
trace_nfs_create_enter(dir, dentry, open_flags);
error = NFS_PROTO(dir)->create(dir, dentry, &attr, open_flags);
trace_nfs_create_exit(dir, dentry, open_flags, error);
if (error != 0)
goto out_err;
return 0;
out_err:
d_drop(dentry);
return error;
}
EXPORT_SYMBOL_GPL(nfs_create);
/*
* See comments for nfs_proc_create regarding failed operations.
*/
int
nfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t rdev)
{
struct iattr attr;
int status;
dfprintk(VFS, "NFS: mknod(%s/%lu), %pd\n",
dir->i_sb->s_id, dir->i_ino, dentry);
attr.ia_mode = mode;
attr.ia_valid = ATTR_MODE;
trace_nfs_mknod_enter(dir, dentry);
status = NFS_PROTO(dir)->mknod(dir, dentry, &attr, rdev);
trace_nfs_mknod_exit(dir, dentry, status);
if (status != 0)
goto out_err;
return 0;
out_err:
d_drop(dentry);
return status;
}
EXPORT_SYMBOL_GPL(nfs_mknod);
/*
* See comments for nfs_proc_create regarding failed operations.
*/
int nfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
{
struct iattr attr;
int error;
dfprintk(VFS, "NFS: mkdir(%s/%lu), %pd\n",
dir->i_sb->s_id, dir->i_ino, dentry);
attr.ia_valid = ATTR_MODE;
attr.ia_mode = mode | S_IFDIR;
trace_nfs_mkdir_enter(dir, dentry);
error = NFS_PROTO(dir)->mkdir(dir, dentry, &attr);
trace_nfs_mkdir_exit(dir, dentry, error);
if (error != 0)
goto out_err;
return 0;
out_err:
d_drop(dentry);
return error;
}
EXPORT_SYMBOL_GPL(nfs_mkdir);
static void nfs_dentry_handle_enoent(struct dentry *dentry)
{
if (simple_positive(dentry))
d_delete(dentry);
}
int nfs_rmdir(struct inode *dir, struct dentry *dentry)
{
int error;
dfprintk(VFS, "NFS: rmdir(%s/%lu), %pd\n",
dir->i_sb->s_id, dir->i_ino, dentry);
trace_nfs_rmdir_enter(dir, dentry);
if (d_really_is_positive(dentry)) {
down_write(&NFS_I(d_inode(dentry))->rmdir_sem);
error = NFS_PROTO(dir)->rmdir(dir, &dentry->d_name);
/* Ensure the VFS deletes this inode */
switch (error) {
case 0:
clear_nlink(d_inode(dentry));
break;
case -ENOENT:
nfs_dentry_handle_enoent(dentry);
}
up_write(&NFS_I(d_inode(dentry))->rmdir_sem);
} else
error = NFS_PROTO(dir)->rmdir(dir, &dentry->d_name);
trace_nfs_rmdir_exit(dir, dentry, error);
return error;
}
EXPORT_SYMBOL_GPL(nfs_rmdir);
/*
* Remove a file after making sure there are no pending writes,
* and after checking that the file has only one user.
*
* We invalidate the attribute cache and free the inode prior to the operation
* to avoid possible races if the server reuses the inode.
*/
static int nfs_safe_remove(struct dentry *dentry)
{
struct inode *dir = d_inode(dentry->d_parent);
struct inode *inode = d_inode(dentry);
int error = -EBUSY;
dfprintk(VFS, "NFS: safe_remove(%pd2)\n", dentry);
/* If the dentry was sillyrenamed, we simply call d_delete() */
if (dentry->d_flags & DCACHE_NFSFS_RENAMED) {
error = 0;
goto out;
}
trace_nfs_remove_enter(dir, dentry);
if (inode != NULL) {
error = NFS_PROTO(dir)->remove(dir, dentry);
if (error == 0)
nfs_drop_nlink(inode);
} else
error = NFS_PROTO(dir)->remove(dir, dentry);
if (error == -ENOENT)
nfs_dentry_handle_enoent(dentry);
trace_nfs_remove_exit(dir, dentry, error);
out:
return error;
}
/* We do silly rename. In case sillyrename() returns -EBUSY, the inode
* belongs to an active ".nfs..." file and we return -EBUSY.
*
* If sillyrename() returns 0, we do nothing, otherwise we unlink.
*/
int nfs_unlink(struct inode *dir, struct dentry *dentry)
{
int error;
int need_rehash = 0;
dfprintk(VFS, "NFS: unlink(%s/%lu, %pd)\n", dir->i_sb->s_id,
dir->i_ino, dentry);
trace_nfs_unlink_enter(dir, dentry);
spin_lock(&dentry->d_lock);
if (d_count(dentry) > 1) {
spin_unlock(&dentry->d_lock);
/* Start asynchronous writeout of the inode */
write_inode_now(d_inode(dentry), 0);
error = nfs_sillyrename(dir, dentry);
goto out;
}
if (!d_unhashed(dentry)) {
__d_drop(dentry);
need_rehash = 1;
}
spin_unlock(&dentry->d_lock);
error = nfs_safe_remove(dentry);
if (!error || error == -ENOENT) {
nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
} else if (need_rehash)
d_rehash(dentry);
out:
trace_nfs_unlink_exit(dir, dentry, error);
return error;
}
EXPORT_SYMBOL_GPL(nfs_unlink);
/*
* To create a symbolic link, most file systems instantiate a new inode,
* add a page to it containing the path, then write it out to the disk
* using prepare_write/commit_write.
*
* Unfortunately the NFS client can't create the in-core inode first
* because it needs a file handle to create an in-core inode (see
* fs/nfs/inode.c:nfs_fhget). We only have a file handle *after* the
* symlink request has completed on the server.
*
* So instead we allocate a raw page, copy the symname into it, then do
* the SYMLINK request with the page as the buffer. If it succeeds, we
* now have a new file handle and can instantiate an in-core NFS inode
* and move the raw page into its mapping.
*/
int nfs_symlink(struct inode *dir, struct dentry *dentry, const char *symname)
{
struct page *page;
char *kaddr;
struct iattr attr;
unsigned int pathlen = strlen(symname);
int error;
dfprintk(VFS, "NFS: symlink(%s/%lu, %pd, %s)\n", dir->i_sb->s_id,
dir->i_ino, dentry, symname);
if (pathlen > PAGE_SIZE)
return -ENAMETOOLONG;
attr.ia_mode = S_IFLNK | S_IRWXUGO;
attr.ia_valid = ATTR_MODE;
page = alloc_page(GFP_USER);
if (!page)
return -ENOMEM;
kaddr = page_address(page);
memcpy(kaddr, symname, pathlen);
if (pathlen < PAGE_SIZE)
memset(kaddr + pathlen, 0, PAGE_SIZE - pathlen);
trace_nfs_symlink_enter(dir, dentry);
error = NFS_PROTO(dir)->symlink(dir, dentry, page, pathlen, &attr);
trace_nfs_symlink_exit(dir, dentry, error);
if (error != 0) {
dfprintk(VFS, "NFS: symlink(%s/%lu, %pd, %s) error %d\n",
dir->i_sb->s_id, dir->i_ino,
dentry, symname, error);
d_drop(dentry);
__free_page(page);
return error;
}
/*
* No big deal if we can't add this page to the page cache here.
* READLINK will get the missing page from the server if needed.
*/
if (!add_to_page_cache_lru(page, d_inode(dentry)->i_mapping, 0,
GFP_KERNEL)) {
SetPageUptodate(page);
unlock_page(page);
/*
* add_to_page_cache_lru() grabs an extra page refcount.
* Drop it here to avoid leaking this page later.
*/
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 19:29:47 +07:00
put_page(page);
} else
__free_page(page);
return 0;
}
EXPORT_SYMBOL_GPL(nfs_symlink);
int
nfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
{
struct inode *inode = d_inode(old_dentry);
int error;
dfprintk(VFS, "NFS: link(%pd2 -> %pd2)\n",
old_dentry, dentry);
trace_nfs_link_enter(inode, dir, dentry);
d_drop(dentry);
error = NFS_PROTO(dir)->link(inode, dir, &dentry->d_name);
if (error == 0) {
ihold(inode);
d_add(dentry, inode);
}
trace_nfs_link_exit(inode, dir, dentry, error);
return error;
}
EXPORT_SYMBOL_GPL(nfs_link);
/*
* RENAME
* FIXME: Some nfsds, like the Linux user space nfsd, may generate a
* different file handle for the same inode after a rename (e.g. when
* moving to a different directory). A fail-safe method to do so would
* be to look up old_dir/old_name, create a link to new_dir/new_name and
* rename the old file using the sillyrename stuff. This way, the original
* file in old_dir will go away when the last process iput()s the inode.
*
* FIXED.
*
* It actually works quite well. One needs to have the possibility for
* at least one ".nfs..." file in each directory the file ever gets
* moved or linked to which happens automagically with the new
* implementation that only depends on the dcache stuff instead of
* using the inode layer
*
* Unfortunately, things are a little more complicated than indicated
* above. For a cross-directory move, we want to make sure we can get
* rid of the old inode after the operation. This means there must be
* no pending writes (if it's a file), and the use count must be 1.
* If these conditions are met, we can drop the dentries before doing
* the rename.
*/
int nfs_rename(struct inode *old_dir, struct dentry *old_dentry,
struct inode *new_dir, struct dentry *new_dentry,
unsigned int flags)
{
struct inode *old_inode = d_inode(old_dentry);
struct inode *new_inode = d_inode(new_dentry);
struct dentry *dentry = NULL, *rehash = NULL;
struct rpc_task *task;
int error = -EBUSY;
if (flags)
return -EINVAL;
dfprintk(VFS, "NFS: rename(%pd2 -> %pd2, ct=%d)\n",
old_dentry, new_dentry,
d_count(new_dentry));
trace_nfs_rename_enter(old_dir, old_dentry, new_dir, new_dentry);
/*
* For non-directories, check whether the target is busy and if so,
* make a copy of the dentry and then do a silly-rename. If the
* silly-rename succeeds, the copied dentry is hashed and becomes
* the new target.
*/
if (new_inode && !S_ISDIR(new_inode->i_mode)) {
/*
* To prevent any new references to the target during the
* rename, we unhash the dentry in advance.
*/
if (!d_unhashed(new_dentry)) {
d_drop(new_dentry);
rehash = new_dentry;
}
if (d_count(new_dentry) > 2) {
int err;
/* copy the target dentry's name */
dentry = d_alloc(new_dentry->d_parent,
&new_dentry->d_name);
if (!dentry)
goto out;
/* silly-rename the existing target ... */
err = nfs_sillyrename(new_dir, new_dentry);
if (err)
goto out;
new_dentry = dentry;
rehash = NULL;
new_inode = NULL;
}
}
task = nfs_async_rename(old_dir, new_dir, old_dentry, new_dentry, NULL);
if (IS_ERR(task)) {
error = PTR_ERR(task);
goto out;
}
error = rpc_wait_for_completion_task(task);
if (error != 0) {
((struct nfs_renamedata *)task->tk_calldata)->cancelled = 1;
/* Paired with the atomic_dec_and_test() barrier in rpc_do_put_task() */
smp_wmb();
} else
error = task->tk_status;
rpc_put_task(task);
/* Ensure the inode attributes are revalidated */
if (error == 0) {
spin_lock(&old_inode->i_lock);
NFS_I(old_inode)->attr_gencount = nfs_inc_attr_generation_counter();
NFS_I(old_inode)->cache_validity |= NFS_INO_INVALID_CHANGE
| NFS_INO_INVALID_CTIME
| NFS_INO_REVAL_FORCED;
spin_unlock(&old_inode->i_lock);
}
out:
if (rehash)
d_rehash(rehash);
trace_nfs_rename_exit(old_dir, old_dentry,
new_dir, new_dentry, error);
if (!error) {
if (new_inode != NULL)
nfs_drop_nlink(new_inode);
/*
* The d_move() should be here instead of in an async RPC completion
* handler because we need the proper locks to move the dentry. If
* we're interrupted by a signal, the async RPC completion handler
* should mark the directories for revalidation.
*/
d_move(old_dentry, new_dentry);
nfs_set_verifier(old_dentry,
nfs_save_change_attribute(new_dir));
} else if (error == -ENOENT)
nfs_dentry_handle_enoent(old_dentry);
/* new dentry created? */
if (dentry)
dput(dentry);
return error;
}
EXPORT_SYMBOL_GPL(nfs_rename);
static DEFINE_SPINLOCK(nfs_access_lru_lock);
static LIST_HEAD(nfs_access_lru_list);
static atomic_long_t nfs_access_nr_entries;
static unsigned long nfs_access_max_cachesize = ULONG_MAX;
module_param(nfs_access_max_cachesize, ulong, 0644);
MODULE_PARM_DESC(nfs_access_max_cachesize, "NFS access maximum total cache length");
static void nfs_access_free_entry(struct nfs_access_entry *entry)
{
put_cred(entry->cred);
kfree_rcu(entry, rcu_head);
smp_mb__before_atomic();
atomic_long_dec(&nfs_access_nr_entries);
smp_mb__after_atomic();
}
static void nfs_access_free_list(struct list_head *head)
{
struct nfs_access_entry *cache;
while (!list_empty(head)) {
cache = list_entry(head->next, struct nfs_access_entry, lru);
list_del(&cache->lru);
nfs_access_free_entry(cache);
}
}
static unsigned long
nfs_do_access_cache_scan(unsigned int nr_to_scan)
{
LIST_HEAD(head);
struct nfs_inode *nfsi, *next;
struct nfs_access_entry *cache;
fs: convert fs shrinkers to new scan/count API Convert the filesystem shrinkers to use the new API, and standardise some of the behaviours of the shrinkers at the same time. For example, nr_to_scan means the number of objects to scan, not the number of objects to free. I refactored the CIFS idmap shrinker a little - it really needs to be broken up into a shrinker per tree and keep an item count with the tree root so that we don't need to walk the tree every time the shrinker needs to count the number of objects in the tree (i.e. all the time under memory pressure). [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree] [assorted fixes folded in] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 07:18:09 +07:00
long freed = 0;
spin_lock(&nfs_access_lru_lock);
list_for_each_entry_safe(nfsi, next, &nfs_access_lru_list, access_cache_inode_lru) {
struct inode *inode;
if (nr_to_scan-- == 0)
break;
inode = &nfsi->vfs_inode;
spin_lock(&inode->i_lock);
if (list_empty(&nfsi->access_cache_entry_lru))
goto remove_lru_entry;
cache = list_entry(nfsi->access_cache_entry_lru.next,
struct nfs_access_entry, lru);
list_move(&cache->lru, &head);
rb_erase(&cache->rb_node, &nfsi->access_cache);
fs: convert fs shrinkers to new scan/count API Convert the filesystem shrinkers to use the new API, and standardise some of the behaviours of the shrinkers at the same time. For example, nr_to_scan means the number of objects to scan, not the number of objects to free. I refactored the CIFS idmap shrinker a little - it really needs to be broken up into a shrinker per tree and keep an item count with the tree root so that we don't need to walk the tree every time the shrinker needs to count the number of objects in the tree (i.e. all the time under memory pressure). [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree] [assorted fixes folded in] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 07:18:09 +07:00
freed++;
if (!list_empty(&nfsi->access_cache_entry_lru))
list_move_tail(&nfsi->access_cache_inode_lru,
&nfs_access_lru_list);
else {
remove_lru_entry:
list_del_init(&nfsi->access_cache_inode_lru);
smp_mb__before_atomic();
clear_bit(NFS_INO_ACL_LRU_SET, &nfsi->flags);
smp_mb__after_atomic();
}
spin_unlock(&inode->i_lock);
}
spin_unlock(&nfs_access_lru_lock);
nfs_access_free_list(&head);
fs: convert fs shrinkers to new scan/count API Convert the filesystem shrinkers to use the new API, and standardise some of the behaviours of the shrinkers at the same time. For example, nr_to_scan means the number of objects to scan, not the number of objects to free. I refactored the CIFS idmap shrinker a little - it really needs to be broken up into a shrinker per tree and keep an item count with the tree root so that we don't need to walk the tree every time the shrinker needs to count the number of objects in the tree (i.e. all the time under memory pressure). [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree] [assorted fixes folded in] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 07:18:09 +07:00
return freed;
}
unsigned long
nfs_access_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
{
int nr_to_scan = sc->nr_to_scan;
gfp_t gfp_mask = sc->gfp_mask;
if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
return SHRINK_STOP;
return nfs_do_access_cache_scan(nr_to_scan);
}
fs: convert fs shrinkers to new scan/count API Convert the filesystem shrinkers to use the new API, and standardise some of the behaviours of the shrinkers at the same time. For example, nr_to_scan means the number of objects to scan, not the number of objects to free. I refactored the CIFS idmap shrinker a little - it really needs to be broken up into a shrinker per tree and keep an item count with the tree root so that we don't need to walk the tree every time the shrinker needs to count the number of objects in the tree (i.e. all the time under memory pressure). [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree] [assorted fixes folded in] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 07:18:09 +07:00
unsigned long
nfs_access_cache_count(struct shrinker *shrink, struct shrink_control *sc)
{
super: fix calculation of shrinkable objects for small numbers The sysctl knob sysctl_vfs_cache_pressure is used to determine which percentage of the shrinkable objects in our cache we should actively try to shrink. It works great in situations in which we have many objects (at least more than 100), because the aproximation errors will be negligible. But if this is not the case, specially when total_objects < 100, we may end up concluding that we have no objects at all (total / 100 = 0, if total < 100). This is certainly not the biggest killer in the world, but may matter in very low kernel memory situations. Signed-off-by: Glauber Costa <glommer@openvz.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 07:17:53 +07:00
return vfs_pressure_ratio(atomic_long_read(&nfs_access_nr_entries));
}
static void
nfs_access_cache_enforce_limit(void)
{
long nr_entries = atomic_long_read(&nfs_access_nr_entries);
unsigned long diff;
unsigned int nr_to_scan;
if (nr_entries < 0 || nr_entries <= nfs_access_max_cachesize)
return;
nr_to_scan = 100;
diff = nr_entries - nfs_access_max_cachesize;
if (diff < nr_to_scan)
nr_to_scan = diff;
nfs_do_access_cache_scan(nr_to_scan);
}
static void __nfs_access_zap_cache(struct nfs_inode *nfsi, struct list_head *head)
{
struct rb_root *root_node = &nfsi->access_cache;
struct rb_node *n;
struct nfs_access_entry *entry;
/* Unhook entries from the cache */
while ((n = rb_first(root_node)) != NULL) {
entry = rb_entry(n, struct nfs_access_entry, rb_node);
rb_erase(n, root_node);
list_move(&entry->lru, head);
}
nfsi->cache_validity &= ~NFS_INO_INVALID_ACCESS;
}
void nfs_access_zap_cache(struct inode *inode)
{
LIST_HEAD(head);
if (test_bit(NFS_INO_ACL_LRU_SET, &NFS_I(inode)->flags) == 0)
return;
/* Remove from global LRU init */
spin_lock(&nfs_access_lru_lock);
if (test_and_clear_bit(NFS_INO_ACL_LRU_SET, &NFS_I(inode)->flags))
list_del_init(&NFS_I(inode)->access_cache_inode_lru);
spin_lock(&inode->i_lock);
__nfs_access_zap_cache(NFS_I(inode), &head);
spin_unlock(&inode->i_lock);
spin_unlock(&nfs_access_lru_lock);
nfs_access_free_list(&head);
}
EXPORT_SYMBOL_GPL(nfs_access_zap_cache);
static struct nfs_access_entry *nfs_access_search_rbtree(struct inode *inode, const struct cred *cred)
{
struct rb_node *n = NFS_I(inode)->access_cache.rb_node;
while (n != NULL) {
struct nfs_access_entry *entry =
rb_entry(n, struct nfs_access_entry, rb_node);
int cmp = cred_fscmp(cred, entry->cred);
if (cmp < 0)
n = n->rb_left;
else if (cmp > 0)
n = n->rb_right;
else
return entry;
}
return NULL;
}
static int nfs_access_get_cached(struct inode *inode, const struct cred *cred, struct nfs_access_entry *res, bool may_block)
{
struct nfs_inode *nfsi = NFS_I(inode);
struct nfs_access_entry *cache;
bool retry = true;
int err;
spin_lock(&inode->i_lock);
for(;;) {
if (nfsi->cache_validity & NFS_INO_INVALID_ACCESS)
goto out_zap;
cache = nfs_access_search_rbtree(inode, cred);
err = -ENOENT;
if (cache == NULL)
goto out;
/* Found an entry, is our attribute cache valid? */
if (!nfs_check_cache_invalid(inode, NFS_INO_INVALID_ACCESS))
break;
err = -ECHILD;
if (!may_block)
goto out;
if (!retry)
goto out_zap;
spin_unlock(&inode->i_lock);
err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
if (err)
return err;
spin_lock(&inode->i_lock);
retry = false;
}
res->cred = cache->cred;
res->mask = cache->mask;
list_move_tail(&cache->lru, &nfsi->access_cache_entry_lru);
err = 0;
out:
spin_unlock(&inode->i_lock);
return err;
out_zap:
spin_unlock(&inode->i_lock);
nfs_access_zap_cache(inode);
return -ENOENT;
}
static int nfs_access_get_cached_rcu(struct inode *inode, const struct cred *cred, struct nfs_access_entry *res)
{
/* Only check the most recently returned cache entry,
* but do it without locking.
*/
struct nfs_inode *nfsi = NFS_I(inode);
struct nfs_access_entry *cache;
int err = -ECHILD;
struct list_head *lh;
rcu_read_lock();
if (nfsi->cache_validity & NFS_INO_INVALID_ACCESS)
goto out;
lh = rcu_dereference(nfsi->access_cache_entry_lru.prev);
cache = list_entry(lh, struct nfs_access_entry, lru);
if (lh == &nfsi->access_cache_entry_lru ||
cred != cache->cred)
cache = NULL;
if (cache == NULL)
goto out;
if (nfs_check_cache_invalid(inode, NFS_INO_INVALID_ACCESS))
goto out;
res->cred = cache->cred;
res->mask = cache->mask;
err = 0;
out:
rcu_read_unlock();
return err;
}
static void nfs_access_add_rbtree(struct inode *inode, struct nfs_access_entry *set)
{
struct nfs_inode *nfsi = NFS_I(inode);
struct rb_root *root_node = &nfsi->access_cache;
struct rb_node **p = &root_node->rb_node;
struct rb_node *parent = NULL;
struct nfs_access_entry *entry;
int cmp;
spin_lock(&inode->i_lock);
while (*p != NULL) {
parent = *p;
entry = rb_entry(parent, struct nfs_access_entry, rb_node);
cmp = cred_fscmp(set->cred, entry->cred);
if (cmp < 0)
p = &parent->rb_left;
else if (cmp > 0)
p = &parent->rb_right;
else
goto found;
}
rb_link_node(&set->rb_node, parent, p);
rb_insert_color(&set->rb_node, root_node);
list_add_tail(&set->lru, &nfsi->access_cache_entry_lru);
spin_unlock(&inode->i_lock);
return;
found:
rb_replace_node(parent, &set->rb_node, root_node);
list_add_tail(&set->lru, &nfsi->access_cache_entry_lru);
list_del(&entry->lru);
spin_unlock(&inode->i_lock);
nfs_access_free_entry(entry);
}
void nfs_access_add_cache(struct inode *inode, struct nfs_access_entry *set)
{
struct nfs_access_entry *cache = kmalloc(sizeof(*cache), GFP_KERNEL);
if (cache == NULL)
return;
RB_CLEAR_NODE(&cache->rb_node);
cache->cred = get_cred(set->cred);
cache->mask = set->mask;
/* The above field assignments must be visible
* before this item appears on the lru. We cannot easily
* use rcu_assign_pointer, so just force the memory barrier.
*/
smp_wmb();
nfs_access_add_rbtree(inode, cache);
/* Update accounting */
smp_mb__before_atomic();
atomic_long_inc(&nfs_access_nr_entries);
smp_mb__after_atomic();
/* Add inode to global LRU list */
if (!test_bit(NFS_INO_ACL_LRU_SET, &NFS_I(inode)->flags)) {
spin_lock(&nfs_access_lru_lock);
if (!test_and_set_bit(NFS_INO_ACL_LRU_SET, &NFS_I(inode)->flags))
list_add_tail(&NFS_I(inode)->access_cache_inode_lru,
&nfs_access_lru_list);
spin_unlock(&nfs_access_lru_lock);
}
nfs_access_cache_enforce_limit();
}
EXPORT_SYMBOL_GPL(nfs_access_add_cache);
#define NFS_MAY_READ (NFS_ACCESS_READ)
#define NFS_MAY_WRITE (NFS_ACCESS_MODIFY | \
NFS_ACCESS_EXTEND | \
NFS_ACCESS_DELETE)
#define NFS_FILE_MAY_WRITE (NFS_ACCESS_MODIFY | \
NFS_ACCESS_EXTEND)
#define NFS_DIR_MAY_WRITE NFS_MAY_WRITE
#define NFS_MAY_LOOKUP (NFS_ACCESS_LOOKUP)
#define NFS_MAY_EXECUTE (NFS_ACCESS_EXECUTE)
static int
nfs_access_calc_mask(u32 access_result, umode_t umode)
{
int mask = 0;
if (access_result & NFS_MAY_READ)
mask |= MAY_READ;
if (S_ISDIR(umode)) {
if ((access_result & NFS_DIR_MAY_WRITE) == NFS_DIR_MAY_WRITE)
mask |= MAY_WRITE;
if ((access_result & NFS_MAY_LOOKUP) == NFS_MAY_LOOKUP)
mask |= MAY_EXEC;
} else if (S_ISREG(umode)) {
if ((access_result & NFS_FILE_MAY_WRITE) == NFS_FILE_MAY_WRITE)
mask |= MAY_WRITE;
if ((access_result & NFS_MAY_EXECUTE) == NFS_MAY_EXECUTE)
mask |= MAY_EXEC;
} else if (access_result & NFS_MAY_WRITE)
mask |= MAY_WRITE;
return mask;
}
void nfs_access_set_mask(struct nfs_access_entry *entry, u32 access_result)
{
entry->mask = access_result;
}
EXPORT_SYMBOL_GPL(nfs_access_set_mask);
static int nfs_do_access(struct inode *inode, const struct cred *cred, int mask)
{
struct nfs_access_entry cache;
bool may_block = (mask & MAY_NOT_BLOCK) == 0;
int cache_mask;
int status;
trace_nfs_access_enter(inode);
status = nfs_access_get_cached_rcu(inode, cred, &cache);
if (status != 0)
status = nfs_access_get_cached(inode, cred, &cache, may_block);
if (status == 0)
goto out_cached;
status = -ECHILD;
if (!may_block)
goto out;
/*
* Determine which access bits we want to ask for...
*/
cache.mask = NFS_ACCESS_READ | NFS_ACCESS_MODIFY | NFS_ACCESS_EXTEND;
if (S_ISDIR(inode->i_mode))
cache.mask |= NFS_ACCESS_DELETE | NFS_ACCESS_LOOKUP;
else
cache.mask |= NFS_ACCESS_EXECUTE;
cache.cred = cred;
status = NFS_PROTO(inode)->access(inode, &cache);
NFS: Handle -ESTALE error in access() Hi Trond, I have been looking at a bugreport where trying to open applications on KDE on a NFS mounted home fails temporarily. There have been multiple reports on different kernel versions pointing to this common issue: http://bugzilla.kernel.org/show_bug.cgi?id=12557 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/269954 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=508866.html This issue can be reproducible consistently by doing this on a NFS mounted home (KDE): 1. Open 2 xterm sessions 2. From one of the xterm session, do "ssh -X <remote host>" 3. "stat ~/.Xauthority" on the remote SSH session 4. Close the two xterm sessions 5. On the server do a "stat ~/.Xauthority" 6. Now on the client, try to open xterm This will fail. Even if the filehandle had become stale, the NFS client should invalidate the cache/inode and should repeat LOOKUP. Looking at the packet capture when the failure occurs shows that there were two subsequent ACCESS() calls with the same filehandle and both fails with -ESTALE error. I have tested the fix below. Now the client issue a LOOKUP after the ACCESS() call fails with -ESTALE. If all this makes sense to you, can you consider this for inclusion? Thanks, If the server returns an -ESTALE error due to stale filehandle in response to an ACCESS() call, we need to invalidate the cache and inode so that LOOKUP() can be retried. Without this change, the nfs client retries ACCESS() with the same filehandle, fails again and could lead to temporary failure of applications running on nfs mounted home. Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 07:33:21 +07:00
if (status != 0) {
if (status == -ESTALE) {
nfs_zap_caches(inode);
if (!S_ISDIR(inode->i_mode))
set_bit(NFS_INO_STALE, &NFS_I(inode)->flags);
}
goto out;
NFS: Handle -ESTALE error in access() Hi Trond, I have been looking at a bugreport where trying to open applications on KDE on a NFS mounted home fails temporarily. There have been multiple reports on different kernel versions pointing to this common issue: http://bugzilla.kernel.org/show_bug.cgi?id=12557 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/269954 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=508866.html This issue can be reproducible consistently by doing this on a NFS mounted home (KDE): 1. Open 2 xterm sessions 2. From one of the xterm session, do "ssh -X <remote host>" 3. "stat ~/.Xauthority" on the remote SSH session 4. Close the two xterm sessions 5. On the server do a "stat ~/.Xauthority" 6. Now on the client, try to open xterm This will fail. Even if the filehandle had become stale, the NFS client should invalidate the cache/inode and should repeat LOOKUP. Looking at the packet capture when the failure occurs shows that there were two subsequent ACCESS() calls with the same filehandle and both fails with -ESTALE error. I have tested the fix below. Now the client issue a LOOKUP after the ACCESS() call fails with -ESTALE. If all this makes sense to you, can you consider this for inclusion? Thanks, If the server returns an -ESTALE error due to stale filehandle in response to an ACCESS() call, we need to invalidate the cache and inode so that LOOKUP() can be retried. Without this change, the nfs client retries ACCESS() with the same filehandle, fails again and could lead to temporary failure of applications running on nfs mounted home. Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 07:33:21 +07:00
}
nfs_access_add_cache(inode, &cache);
out_cached:
cache_mask = nfs_access_calc_mask(cache.mask, inode->i_mode);
if ((mask & ~cache_mask & (MAY_READ | MAY_WRITE | MAY_EXEC)) != 0)
status = -EACCES;
out:
trace_nfs_access_exit(inode, status);
return status;
}
static int nfs_open_permission_mask(int openflags)
{
int mask = 0;
if (openflags & __FMODE_EXEC) {
/* ONLY check exec rights */
mask = MAY_EXEC;
} else {
if ((openflags & O_ACCMODE) != O_WRONLY)
mask |= MAY_READ;
if ((openflags & O_ACCMODE) != O_RDONLY)
mask |= MAY_WRITE;
}
return mask;
}
int nfs_may_open(struct inode *inode, const struct cred *cred, int openflags)
{
return nfs_do_access(inode, cred, nfs_open_permission_mask(openflags));
}
EXPORT_SYMBOL_GPL(nfs_may_open);
static int nfs_execute_ok(struct inode *inode, int mask)
{
struct nfs_server *server = NFS_SERVER(inode);
int ret = 0;
if (S_ISDIR(inode->i_mode))
return 0;
if (nfs_check_cache_invalid(inode, NFS_INO_INVALID_OTHER)) {
if (mask & MAY_NOT_BLOCK)
return -ECHILD;
ret = __nfs_revalidate_inode(server, inode);
}
if (ret == 0 && !execute_ok(inode))
ret = -EACCES;
return ret;
}
int nfs_permission(struct inode *inode, int mask)
{
const struct cred *cred = current_cred();
int res = 0;
nfs_inc_stats(inode, NFSIOS_VFSACCESS);
if ((mask & (MAY_READ | MAY_WRITE | MAY_EXEC)) == 0)
goto out;
/* Is this sys_access() ? */
if (mask & (MAY_ACCESS | MAY_CHDIR))
goto force_lookup;
switch (inode->i_mode & S_IFMT) {
case S_IFLNK:
goto out;
case S_IFREG:
NFSv4: Don't perform cached access checks before we've OPENed the file Donald Buczek reports that a nfs4 client incorrectly denies execute access based on outdated file mode (missing 'x' bit). After the mode on the server is 'fixed' (chmod +x) further execution attempts continue to fail, because the nfs ACCESS call updates the access parameter but not the mode parameter or the mode in the inode. The root cause is ultimately that the VFS is calling may_open() before the NFS client has a chance to OPEN the file and hence revalidate the access and attribute caches. Al Viro suggests: >>> Make nfs_permission() relax the checks when it sees MAY_OPEN, if you know >>> that things will be caught by server anyway? >> >> That can work as long as we're guaranteed that everything that calls >> inode_permission() with MAY_OPEN on a regular file will also follow up >> with a vfs_open() or dentry_open() on success. Is this always the >> case? > > 1) in do_tmpfile(), followed by do_dentry_open() (not reachable by NFS since > it doesn't have ->tmpfile() instance anyway) > > 2) in atomic_open(), after the call of ->atomic_open() has succeeded. > > 3) in do_last(), followed on success by vfs_open() > > That's all. All calls of inode_permission() that get MAY_OPEN come from > may_open(), and there's no other callers of that puppy. Reported-by: Donald Buczek <buczek@molgen.mpg.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=109771 Link: http://lkml.kernel.org/r/1451046656-26319-1-git-send-email-buczek@molgen.mpg.de Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-12-27 09:54:58 +07:00
if ((mask & MAY_OPEN) &&
nfs_server_capable(inode, NFS_CAP_ATOMIC_OPEN))
return 0;
break;
case S_IFDIR:
/*
* Optimize away all write operations, since the server
* will check permissions when we perform the op.
*/
if ((mask & MAY_WRITE) && !(mask & MAY_READ))
goto out;
}
force_lookup:
if (!NFS_PROTO(inode)->access)
goto out_notsup;
/* Always try fast lookups first */
rcu_read_lock();
res = nfs_do_access(inode, cred, mask|MAY_NOT_BLOCK);
rcu_read_unlock();
if (res == -ECHILD && !(mask & MAY_NOT_BLOCK)) {
/* Fast lookup failed, try the slow way */
res = nfs_do_access(inode, cred, mask);
}
out:
if (!res && (mask & MAY_EXEC))
res = nfs_execute_ok(inode, mask);
dfprintk(VFS, "NFS: permission(%s/%lu), mask=0x%x, res=%d\n",
inode->i_sb->s_id, inode->i_ino, mask, res);
return res;
out_notsup:
if (mask & MAY_NOT_BLOCK)
return -ECHILD;
res = nfs_revalidate_inode(NFS_SERVER(inode), inode);
if (res == 0)
res = generic_permission(inode, mask);
goto out;
}
EXPORT_SYMBOL_GPL(nfs_permission);
/*
* Local variables:
* version-control: t
* kept-new-versions: 5
* End:
*/