License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 21:07:57 +07:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-17 05:20:36 +07:00
|
|
|
#ifndef _LINUX_FS_H
|
|
|
|
#define _LINUX_FS_H
|
|
|
|
|
|
|
|
#include <linux/linkage.h>
|
2017-06-20 17:19:09 +07:00
|
|
|
#include <linux/wait_bit.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/kdev_t.h>
|
|
|
|
#include <linux/dcache.h>
|
2008-07-26 14:46:43 +07:00
|
|
|
#include <linux/path.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/stat.h>
|
|
|
|
#include <linux/cache.h>
|
|
|
|
#include <linux/list.h>
|
2013-08-28 07:17:58 +07:00
|
|
|
#include <linux/list_lru.h>
|
2013-07-09 04:24:16 +07:00
|
|
|
#include <linux/llist.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/radix-tree.h>
|
2018-04-11 06:36:56 +07:00
|
|
|
#include <linux/xarray.h>
|
2012-10-09 06:31:25 +07:00
|
|
|
#include <linux/rbtree.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/init.h>
|
2006-10-19 00:55:46 +07:00
|
|
|
#include <linux/pid.h>
|
2011-11-24 08:12:59 +07:00
|
|
|
#include <linux/bug.h>
|
2006-01-10 06:59:24 +07:00
|
|
|
#include <linux/mutex.h>
|
2014-12-13 07:54:24 +07:00
|
|
|
#include <linux/rwsem.h>
|
2017-07-11 05:48:25 +07:00
|
|
|
#include <linux/mm_types.h>
|
2007-07-17 16:30:08 +07:00
|
|
|
#include <linux/capability.h>
|
2008-04-19 09:21:05 +07:00
|
|
|
#include <linux/semaphore.h>
|
fs: add fcntl() interface for setting/getting write life time hints
Define a set of write life time hints:
RWH_WRITE_LIFE_NOT_SET No hint information set
RWH_WRITE_LIFE_NONE No hints about write life time
RWH_WRITE_LIFE_SHORT Data written has a short life time
RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
RWH_WRITE_LIFE_LONG Data written has a long life time
RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time
The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.
Add an fcntl interface for querying these flags, and also for
setting them as well:
F_GET_RW_HINT Returns the read/write hint set on the
underlying inode.
F_SET_RW_HINT Set one of the above write hints on the
underlying inode.
F_GET_FILE_RW_HINT Returns the read/write hint set on the
file descriptor.
F_SET_FILE_RW_HINT Set one of the above write hints on the
file descriptor.
The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.
Sample program testing/implementing basic setting/getting of write
hints is below.
Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.
This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.
/*
* writehint.c: get or set an inode write hint
*/
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdbool.h>
#include <inttypes.h>
#ifndef F_GET_RW_HINT
#define F_LINUX_SPECIFIC_BASE 1024
#define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
#define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
#endif
static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };
int main(int argc, char *argv[])
{
uint64_t hint;
int fd, ret;
if (argc < 2) {
fprintf(stderr, "%s: file <hint>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 2;
}
if (argc > 2) {
hint = atoi(argv[2]);
ret = fcntl(fd, F_SET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_SET_RW_HINT");
return 4;
}
}
ret = fcntl(fd, F_GET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_GET_RW_HINT");
return 3;
}
printf("%s: hint %s\n", argv[1], str[hint]);
close(fd);
return 0;
}
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28 00:47:04 +07:00
|
|
|
#include <linux/fcntl.h>
|
2011-01-07 13:50:05 +07:00
|
|
|
#include <linux/rculist_bl.h>
|
2011-06-20 21:52:57 +07:00
|
|
|
#include <linux/atomic.h>
|
2011-12-09 05:33:54 +07:00
|
|
|
#include <linux/shrinker.h>
|
2012-01-24 07:41:32 +07:00
|
|
|
#include <linux/migrate_mode.h>
|
2012-02-08 22:07:50 +07:00
|
|
|
#include <linux/uidgid.h>
|
2012-06-12 21:20:34 +07:00
|
|
|
#include <linux/lockdep.h>
|
2012-09-27 14:35:03 +07:00
|
|
|
#include <linux/percpu-rwsem.h>
|
2015-07-23 01:21:13 +07:00
|
|
|
#include <linux/workqueue.h>
|
2015-12-30 03:58:39 +07:00
|
|
|
#include <linux/delayed_call.h>
|
2017-05-10 20:06:33 +07:00
|
|
|
#include <linux/uuid.h>
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 18:02:25 +07:00
|
|
|
#include <linux/errseq.h>
|
2018-05-23 00:52:19 +07:00
|
|
|
#include <linux/ioprio.h>
|
2019-01-21 07:54:27 +07:00
|
|
|
#include <linux/fs_types.h>
|
2019-03-08 07:27:07 +07:00
|
|
|
#include <linux/build_bug.h>
|
|
|
|
#include <linux/stddef.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
#include <asm/byteorder.h>
|
2012-10-13 16:46:48 +07:00
|
|
|
#include <uapi/linux/fs.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-01-14 16:42:37 +07:00
|
|
|
struct backing_dev_info;
|
2015-05-23 04:13:37 +07:00
|
|
|
struct bdi_writeback;
|
2016-11-01 20:40:13 +07:00
|
|
|
struct bio;
|
2007-07-17 18:04:28 +07:00
|
|
|
struct export_operations;
|
2020-05-23 14:30:11 +07:00
|
|
|
struct fiemap_extent_info;
|
2006-01-08 16:02:50 +07:00
|
|
|
struct hd_geometry;
|
2005-04-17 05:20:36 +07:00
|
|
|
struct iovec;
|
2005-06-24 12:00:59 +07:00
|
|
|
struct kiocb;
|
2011-01-10 13:18:25 +07:00
|
|
|
struct kobject;
|
2005-04-17 05:20:36 +07:00
|
|
|
struct pipe_inode_info;
|
|
|
|
struct poll_table_struct;
|
|
|
|
struct kstatfs;
|
|
|
|
struct vm_area_struct;
|
|
|
|
struct vfsmount;
|
2008-11-14 06:39:22 +07:00
|
|
|
struct cred;
|
2012-08-01 06:44:57 +07:00
|
|
|
struct swap_info_struct;
|
2012-12-18 07:04:55 +07:00
|
|
|
struct seq_file;
|
2013-09-04 20:04:39 +07:00
|
|
|
struct workqueue_struct;
|
2014-02-12 09:34:08 +07:00
|
|
|
struct iov_iter;
|
2015-05-16 06:26:10 +07:00
|
|
|
struct fscrypt_info;
|
|
|
|
struct fscrypt_operations;
|
2019-07-22 23:26:21 +07:00
|
|
|
struct fsverity_info;
|
|
|
|
struct fsverity_operations;
|
2018-12-24 06:55:56 +07:00
|
|
|
struct fs_context;
|
2019-09-07 18:23:15 +07:00
|
|
|
struct fs_parameter_spec;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-10-17 13:26:30 +07:00
|
|
|
extern void __init inode_init(void);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void __init inode_init_early(void);
|
2015-08-07 05:46:20 +07:00
|
|
|
extern void __init files_init(void);
|
|
|
|
extern void __init files_maxfiles_init(void);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-12-02 05:34:50 +07:00
|
|
|
extern struct files_stat_struct files_stat;
|
2010-10-27 04:22:44 +07:00
|
|
|
extern unsigned long get_max_files(void);
|
2016-09-02 04:38:52 +07:00
|
|
|
extern unsigned int sysctl_nr_open;
|
2008-12-02 05:34:50 +07:00
|
|
|
extern struct inodes_stat_t inodes_stat;
|
|
|
|
extern int leases_enable, lease_break_time;
|
fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-26 07:29:07 +07:00
|
|
|
extern int sysctl_protected_symlinks;
|
|
|
|
extern int sysctl_protected_hardlinks;
|
2018-08-24 07:00:35 +07:00
|
|
|
extern int sysctl_protected_fifos;
|
|
|
|
extern int sysctl_protected_regular;
|
2008-12-02 05:34:50 +07:00
|
|
|
|
2017-07-06 23:58:37 +07:00
|
|
|
typedef __kernel_rwf_t rwf_t;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
struct buffer_head;
|
|
|
|
typedef int (get_block_t)(struct inode *inode, sector_t iblock,
|
|
|
|
struct buffer_head *bh_result, int create);
|
2016-02-08 10:40:51 +07:00
|
|
|
typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
|
2013-09-04 20:04:39 +07:00
|
|
|
ssize_t bytes, void *private);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-10-15 22:40:35 +07:00
|
|
|
#define MAY_EXEC 0x00000001
|
|
|
|
#define MAY_WRITE 0x00000002
|
|
|
|
#define MAY_READ 0x00000004
|
|
|
|
#define MAY_APPEND 0x00000008
|
|
|
|
#define MAY_ACCESS 0x00000010
|
|
|
|
#define MAY_OPEN 0x00000020
|
|
|
|
#define MAY_CHDIR 0x00000040
|
|
|
|
/* called from RCU mode, don't block */
|
|
|
|
#define MAY_NOT_BLOCK 0x00000080
|
|
|
|
|
|
|
|
/*
|
|
|
|
* flags in file.f_mode. Note that FMODE_READ and FMODE_WRITE must correspond
|
2018-05-18 09:01:03 +07:00
|
|
|
* to O_WRONLY and O_RDWR via the strange trick in do_dentry_open()
|
2012-10-15 22:40:35 +07:00
|
|
|
*/
|
|
|
|
|
|
|
|
/* file is open for reading */
|
|
|
|
#define FMODE_READ ((__force fmode_t)0x1)
|
|
|
|
/* file is open for writing */
|
|
|
|
#define FMODE_WRITE ((__force fmode_t)0x2)
|
|
|
|
/* file is seekable */
|
|
|
|
#define FMODE_LSEEK ((__force fmode_t)0x4)
|
|
|
|
/* file can be accessed using pread */
|
|
|
|
#define FMODE_PREAD ((__force fmode_t)0x8)
|
|
|
|
/* file can be accessed using pwrite */
|
|
|
|
#define FMODE_PWRITE ((__force fmode_t)0x10)
|
|
|
|
/* File is opened for execution with sys_execve / sys_uselib */
|
|
|
|
#define FMODE_EXEC ((__force fmode_t)0x20)
|
|
|
|
/* File is opened with O_NDELAY (only set for block devices) */
|
|
|
|
#define FMODE_NDELAY ((__force fmode_t)0x40)
|
|
|
|
/* File is opened with O_EXCL (only set for block devices) */
|
|
|
|
#define FMODE_EXCL ((__force fmode_t)0x80)
|
|
|
|
/* File is opened using open(.., 3, ..) and is writeable only for ioctls
|
|
|
|
(specialy hack for floppy.c) */
|
|
|
|
#define FMODE_WRITE_IOCTL ((__force fmode_t)0x100)
|
|
|
|
/* 32bit hashes as llseek() offset (for directories) */
|
|
|
|
#define FMODE_32BITHASH ((__force fmode_t)0x200)
|
|
|
|
/* 64bit hashes as llseek() offset (for directories) */
|
|
|
|
#define FMODE_64BITHASH ((__force fmode_t)0x400)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Don't update ctime and mtime.
|
|
|
|
*
|
|
|
|
* Currently a special hack for the XFS open_by_handle ioctl, but we'll
|
|
|
|
* hopefully graduate it to a proper O_CMTIME flag supported by open(2) soon.
|
|
|
|
*/
|
|
|
|
#define FMODE_NOCMTIME ((__force fmode_t)0x800)
|
|
|
|
|
|
|
|
/* Expect random access pattern */
|
|
|
|
#define FMODE_RANDOM ((__force fmode_t)0x1000)
|
|
|
|
|
|
|
|
/* File is huge (eg. /dev/kmem): treat loff_t as unsigned */
|
|
|
|
#define FMODE_UNSIGNED_OFFSET ((__force fmode_t)0x2000)
|
|
|
|
|
|
|
|
/* File is opened with O_PATH; almost nothing can be done with it */
|
|
|
|
#define FMODE_PATH ((__force fmode_t)0x4000)
|
|
|
|
|
Revert "vfs: properly and reliably lock f_pos in fdget_pos()"
This reverts commit 0be0ee71816b2b6725e2b4f32ad6726c9d729777.
I was hoping it would be benign to switch over entirely to FMODE_STREAM,
and we'd have just a couple of small fixups we'd need, but it looks like
we're not quite there yet.
While it worked fine on both my desktop and laptop, they are fairly
similar in other respects, and run mostly the same loads. Kenneth
Crudup reports that it seems to break both his vmware installation and
the KDE upower service. In both cases apparently leading to timeouts
due to waitinmg for the f_pos lock.
There are a number of character devices in particular that definitely
want stream-like behavior, but that currently don't get marked as
streams, and as a result get the exclusion between concurrent
read()/write() on the same file descriptor. Which doesn't work well for
them.
The most obvious example if this is /dev/console and /dev/tty, which use
console_fops and tty_fops respectively (and ptmx_fops for the pty master
side). It may be that it's just this that causes problems, but we
clearly weren't ready yet.
Because there's a number of other likely common cases that don't have
llseek implementations and would seem to act as stream devices:
/dev/fuse (fuse_dev_operations)
/dev/mcelog (mce_chrdev_ops)
/dev/mei0 (mei_fops)
/dev/net/tun (tun_fops)
/dev/nvme0 (nvme_dev_fops)
/dev/tpm0 (tpm_fops)
/proc/self/ns/mnt (ns_file_operations)
/dev/snd/pcm* (snd_pcm_f_ops[])
and while some of these could be trivially automatically detected by the
vfs layer when the character device is opened by just noticing that they
have no read or write operations either, it often isn't that obvious.
Some character devices most definitely do use the file position, even if
they don't allow seeking: the firmware update code, for example, uses
simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
back and forth.
We'll revisit this when there's a better way to detect the problem and
fix it (possibly with a coccinelle script to do more of the FMODE_STREAM
annotations).
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Cc: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-27 02:34:06 +07:00
|
|
|
/* File needs atomic accesses to f_pos */
|
|
|
|
#define FMODE_ATOMIC_POS ((__force fmode_t)0x8000)
|
2014-03-14 23:02:47 +07:00
|
|
|
/* Write access to underlying fs */
|
|
|
|
#define FMODE_WRITER ((__force fmode_t)0x10000)
|
2014-02-12 05:49:24 +07:00
|
|
|
/* Has read method(s) */
|
|
|
|
#define FMODE_CAN_READ ((__force fmode_t)0x20000)
|
|
|
|
/* Has write method(s) */
|
|
|
|
#define FMODE_CAN_WRITE ((__force fmode_t)0x40000)
|
2014-03-04 00:36:58 +07:00
|
|
|
|
2018-07-09 13:35:08 +07:00
|
|
|
#define FMODE_OPENED ((__force fmode_t)0x80000)
|
2018-06-09 00:22:02 +07:00
|
|
|
#define FMODE_CREATED ((__force fmode_t)0x100000)
|
2018-07-09 13:35:08 +07:00
|
|
|
|
fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
Commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") added
locking for file.f_pos access and in particular made concurrent read and
write not possible - now both those functions take f_pos lock for the
whole run, and so if e.g. a read is blocked waiting for data, write will
deadlock waiting for that read to complete.
This caused regression for stream-like files where previously read and
write could run simultaneously, but after that patch could not do so
anymore. See e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes
to /proc/xen/xenbus") which fixes such regression for particular case of
/proc/xen/xenbus.
The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
safety for read/write/lseek and added the locking to file descriptors of
all regular files. In 2014 that thread-safety problem was not new as it
was already discussed earlier in 2006.
However even though 2006'th version of Linus's patch was adding f_pos
locking "only for files that are marked seekable with FMODE_LSEEK (thus
avoiding the stream-like objects like pipes and sockets)", the 2014
version - the one that actually made it into the tree as 9c225f2655e3 -
is doing so irregardless of whether a file is seekable or not.
See
https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
https://lwn.net/Articles/180387
https://lwn.net/Articles/180396
for historic context.
The reason that it did so is, probably, that there are many files that
are marked non-seekable, but e.g. their read implementation actually
depends on knowing current position to correctly handle the read. Some
examples:
kernel/power/user.c snapshot_read
fs/debugfs/file.c u32_array_read
fs/fuse/control.c fuse_conn_waiting_read + ...
drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read
arch/s390/hypfs/inode.c hypfs_read_iter
...
Despite that, many nonseekable_open users implement read and write with
pure stream semantics - they don't depend on passed ppos at all. And for
those cases where read could wait for something inside, it creates a
situation similar to xenbus - the write could be never made to go until
read is done, and read is waiting for some, potentially external, event,
for potentially unbounded time -> deadlock.
Besides xenbus, there are 14 such places in the kernel that I've found
with semantic patch (see below):
drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
In addition to the cases above another regression caused by f_pos
locking is that now FUSE filesystems that implement open with
FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
stream-like files - for the same reason as above e.g. read can deadlock
write locking on file.f_pos in the kernel.
FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f715 ("fuse:
implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
write routines not depending on current position at all, and with both
read and write being potentially blocking operations:
See
https://github.com/libfuse/osspd
https://lwn.net/Articles/308445
https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
"somewhat pipe-like files ..." with read handler not using offset.
However that test implements only read without write and cannot exercise
the deadlock scenario:
https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
I've actually hit the read vs write deadlock for real while implementing
my FUSE filesystem where there is /head/watch file, for which open
creates separate bidirectional socket-like stream in between filesystem
and its user with both read and write being later performed
simultaneously. And there it is semantically not easy to split the
stream into two separate read-only and write-only channels:
https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
Let's fix this regression. The plan is:
1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
doing so would break many in-kernel nonseekable_open users which
actually use ppos in read/write handlers.
2. Add stream_open() to kernel to open stream-like non-seekable file
descriptors. Read and write on such file descriptors would never use
nor change ppos. And with that property on stream-like files read and
write will be running without taking f_pos lock - i.e. read and write
could be running simultaneously.
3. With semantic patch search and convert to stream_open all in-kernel
nonseekable_open users for which read and write actually do not
depend on ppos and where there is no other methods in file_operations
which assume @offset access.
4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
steam_open if that bit is present in filesystem open reply.
It was tempting to change fs/fuse/ open handler to use stream_open
instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
and in particular GVFS which actually uses offset in its read and
write handlers
https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
so if we would do such a change it will break a real user.
5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
from v3.14+ (the kernel where 9c225f2655 first appeared).
This will allow to patch OSSPD and other FUSE filesystems that
provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
in their open handler and this way avoid the deadlock on all kernel
versions. This should work because fs/fuse/ ignores unknown open
flags returned from a filesystem and so passing FOPEN_STREAM to a
kernel that is not aware of this flag cannot hurt. In turn the kernel
that is not aware of FOPEN_STREAM will be < v3.14 where just
FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
write deadlock.
This patch adds stream_open, converts /proc/xen/xenbus to it and adds
semantic patch to automatically locate in-kernel places that are either
required to be converted due to read vs write deadlock, or that are just
safe to be converted because read and write do not use ppos and there
are no other funky methods in file_operations.
Regarding semantic patch I've verified each generated change manually -
that it is correct to convert - and each other nonseekable_open instance
left - that it is either not correct to convert there, or that it is not
converted due to current stream_open.cocci limitations.
The script also does not convert files that should be valid to convert,
but that currently have .llseek = noop_llseek or generic_file_llseek for
unknown reason despite file being opened with nonseekable_open (e.g.
drivers/input/mousedev.c)
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-27 05:20:43 +07:00
|
|
|
/* File is stream-like */
|
|
|
|
#define FMODE_STREAM ((__force fmode_t)0x200000)
|
|
|
|
|
2012-10-15 22:40:35 +07:00
|
|
|
/* File was opened by fanotify and shouldn't generate fanotify events */
|
2015-01-09 05:32:29 +07:00
|
|
|
#define FMODE_NONOTIFY ((__force fmode_t)0x4000000)
|
2012-10-15 22:40:35 +07:00
|
|
|
|
2017-08-29 21:13:20 +07:00
|
|
|
/* File is capable of returning -EAGAIN if I/O will block */
|
2018-11-06 00:40:30 +07:00
|
|
|
#define FMODE_NOWAIT ((__force fmode_t)0x8000000)
|
|
|
|
|
|
|
|
/* File represents mount that needs unmounting */
|
|
|
|
#define FMODE_NEED_UNMOUNT ((__force fmode_t)0x10000000)
|
2017-06-20 19:05:43 +07:00
|
|
|
|
2018-07-18 20:44:40 +07:00
|
|
|
/* File does not contribute to nr_files count */
|
2018-11-06 00:40:30 +07:00
|
|
|
#define FMODE_NOACCOUNT ((__force fmode_t)0x20000000)
|
2018-07-18 20:44:40 +07:00
|
|
|
|
2020-05-22 22:12:51 +07:00
|
|
|
/* File supports async buffered reads */
|
|
|
|
#define FMODE_BUF_RASYNC ((__force fmode_t)0x40000000)
|
|
|
|
|
2012-10-15 22:40:35 +07:00
|
|
|
/*
|
|
|
|
* Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
|
|
|
|
* that indicates that they should check the contents of the iovec are
|
|
|
|
* valid, but not check the memory that the iovec elements
|
|
|
|
* points too.
|
|
|
|
*/
|
|
|
|
#define CHECK_IOVEC_ONLY -1
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Attribute flags. These should be or-ed together to figure out what
|
|
|
|
* has been changed!
|
|
|
|
*/
|
2008-07-01 20:01:26 +07:00
|
|
|
#define ATTR_MODE (1 << 0)
|
|
|
|
#define ATTR_UID (1 << 1)
|
|
|
|
#define ATTR_GID (1 << 2)
|
|
|
|
#define ATTR_SIZE (1 << 3)
|
|
|
|
#define ATTR_ATIME (1 << 4)
|
|
|
|
#define ATTR_MTIME (1 << 5)
|
|
|
|
#define ATTR_CTIME (1 << 6)
|
|
|
|
#define ATTR_ATIME_SET (1 << 7)
|
|
|
|
#define ATTR_MTIME_SET (1 << 8)
|
|
|
|
#define ATTR_FORCE (1 << 9) /* Not a change, but a change it */
|
|
|
|
#define ATTR_KILL_SUID (1 << 11)
|
|
|
|
#define ATTR_KILL_SGID (1 << 12)
|
|
|
|
#define ATTR_FILE (1 << 13)
|
|
|
|
#define ATTR_KILL_PRIV (1 << 14)
|
|
|
|
#define ATTR_OPEN (1 << 15) /* Truncating from open(O_TRUNC) */
|
|
|
|
#define ATTR_TIMES_SET (1 << 16)
|
2016-09-16 17:44:20 +07:00
|
|
|
#define ATTR_TOUCH (1 << 17)
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2014-10-24 05:14:36 +07:00
|
|
|
/*
|
|
|
|
* Whiteout is represented by a char device. The following constants define the
|
|
|
|
* mode and device number to use.
|
|
|
|
*/
|
|
|
|
#define WHITEOUT_MODE 0
|
|
|
|
#define WHITEOUT_DEV 0
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* This is the Inode Attributes structure, used for notify_change(). It
|
|
|
|
* uses the above definitions as flags, to know which values have changed.
|
|
|
|
* Also, in this manner, a Filesystem can look at only the values it cares
|
|
|
|
* about. Basically, these are the attributes that the VFS layer can
|
|
|
|
* request to change from the FS layer.
|
|
|
|
*
|
|
|
|
* Derek Atkins <warlord@MIT.EDU> 94-10-20
|
|
|
|
*/
|
|
|
|
struct iattr {
|
|
|
|
unsigned int ia_valid;
|
|
|
|
umode_t ia_mode;
|
2012-02-08 22:07:50 +07:00
|
|
|
kuid_t ia_uid;
|
|
|
|
kgid_t ia_gid;
|
2005-04-17 05:20:36 +07:00
|
|
|
loff_t ia_size;
|
vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.
The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.
The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.
virtual patch
@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}
@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}
@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}
@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }
@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }
@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)
<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)
@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}
@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-05-09 09:36:02 +07:00
|
|
|
struct timespec64 ia_atime;
|
|
|
|
struct timespec64 ia_mtime;
|
|
|
|
struct timespec64 ia_ctime;
|
2005-11-07 15:59:49 +07:00
|
|
|
|
|
|
|
/*
|
2011-03-31 08:57:33 +07:00
|
|
|
* Not an attribute, but an auxiliary info for filesystems wanting to
|
2005-11-07 15:59:49 +07:00
|
|
|
* implement an ftruncate() like method. NOTE: filesystem should
|
|
|
|
* check for (ia_valid & ATTR_FILE), and not for (ia_file != NULL).
|
|
|
|
*/
|
|
|
|
struct file *ia_file;
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Includes for diskquotas.
|
|
|
|
*/
|
|
|
|
#include <linux/quota.h>
|
|
|
|
|
2014-10-24 05:14:39 +07:00
|
|
|
/*
|
|
|
|
* Maximum number of layers of fs stack. Needs to be limited to
|
|
|
|
* prevent kernel stack overflow
|
|
|
|
*/
|
|
|
|
#define FILESYSTEM_MAX_STACK_DEPTH 2
|
|
|
|
|
2005-12-16 05:28:17 +07:00
|
|
|
/**
|
|
|
|
* enum positive_aop_returns - aop return codes with specific semantics
|
|
|
|
*
|
|
|
|
* @AOP_WRITEPAGE_ACTIVATE: Informs the caller that page writeback has
|
|
|
|
* completed, that the page is still locked, and
|
|
|
|
* should be considered active. The VM uses this hint
|
|
|
|
* to return the page to the active list -- it won't
|
|
|
|
* be a candidate for writeback again in the near
|
|
|
|
* future. Other callers must be careful to unlock
|
|
|
|
* the page if they get this return. Returned by
|
|
|
|
* writepage();
|
|
|
|
*
|
|
|
|
* @AOP_TRUNCATED_PAGE: The AOP method that was handed a locked page has
|
|
|
|
* unlocked it and the page might have been truncated.
|
|
|
|
* The caller should back up to acquiring a new page and
|
|
|
|
* trying again. The aop will be taking reasonable
|
|
|
|
* precautions not to livelock. If the caller held a page
|
|
|
|
* reference, it should drop it before retrying. Returned
|
2007-10-16 15:25:26 +07:00
|
|
|
* by readpage().
|
2005-12-16 05:28:17 +07:00
|
|
|
*
|
|
|
|
* address_space_operation functions return these large constants to indicate
|
|
|
|
* special semantics to the caller. These are much larger than the bytes in a
|
|
|
|
* page to allow for functions that return the number of bytes operated on in a
|
|
|
|
* given page.
|
|
|
|
*/
|
|
|
|
|
|
|
|
enum positive_aop_returns {
|
|
|
|
AOP_WRITEPAGE_ACTIVATE = 0x80000,
|
|
|
|
AOP_TRUNCATED_PAGE = 0x80001,
|
|
|
|
};
|
|
|
|
|
2017-05-09 05:58:59 +07:00
|
|
|
#define AOP_FLAG_CONT_EXPAND 0x0001 /* called from cont_expand */
|
|
|
|
#define AOP_FLAG_NOFS 0x0002 /* used by filesystem to direct
|
fs: symlink write_begin allocation context fix
With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened. They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim. This bug could
cause filesystem deadlocks.
The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock. The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.
Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).
This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
random example).
[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Cleaned up the calling convention: just pass in the AOP flags
untouched to the grab_cache_page_write_begin() function. That
just simplifies everybody, and may even allow future expansion of the
logic. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-05 03:00:53 +07:00
|
|
|
* helper code (eg buffer layer)
|
|
|
|
* to clear GFP_FS from alloc */
|
2007-10-16 15:25:01 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* oh the beauties of C type declarations.
|
|
|
|
*/
|
|
|
|
struct page;
|
|
|
|
struct address_space;
|
|
|
|
struct writeback_control;
|
2020-06-02 11:46:44 +07:00
|
|
|
struct readahead_control;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
fs: add fcntl() interface for setting/getting write life time hints
Define a set of write life time hints:
RWH_WRITE_LIFE_NOT_SET No hint information set
RWH_WRITE_LIFE_NONE No hints about write life time
RWH_WRITE_LIFE_SHORT Data written has a short life time
RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
RWH_WRITE_LIFE_LONG Data written has a long life time
RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time
The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.
Add an fcntl interface for querying these flags, and also for
setting them as well:
F_GET_RW_HINT Returns the read/write hint set on the
underlying inode.
F_SET_RW_HINT Set one of the above write hints on the
underlying inode.
F_GET_FILE_RW_HINT Returns the read/write hint set on the
file descriptor.
F_SET_FILE_RW_HINT Set one of the above write hints on the
file descriptor.
The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.
Sample program testing/implementing basic setting/getting of write
hints is below.
Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.
This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.
/*
* writehint.c: get or set an inode write hint
*/
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdbool.h>
#include <inttypes.h>
#ifndef F_GET_RW_HINT
#define F_LINUX_SPECIFIC_BASE 1024
#define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
#define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
#endif
static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };
int main(int argc, char *argv[])
{
uint64_t hint;
int fd, ret;
if (argc < 2) {
fprintf(stderr, "%s: file <hint>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 2;
}
if (argc > 2) {
hint = atoi(argv[2]);
ret = fcntl(fd, F_SET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_SET_RW_HINT");
return 4;
}
}
ret = fcntl(fd, F_GET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_GET_RW_HINT");
return 3;
}
printf("%s: hint %s\n", argv[1], str[hint]);
close(fd);
return 0;
}
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28 00:47:04 +07:00
|
|
|
/*
|
|
|
|
* Write life time hint values.
|
2018-07-05 13:25:43 +07:00
|
|
|
* Stored in struct inode as u8.
|
fs: add fcntl() interface for setting/getting write life time hints
Define a set of write life time hints:
RWH_WRITE_LIFE_NOT_SET No hint information set
RWH_WRITE_LIFE_NONE No hints about write life time
RWH_WRITE_LIFE_SHORT Data written has a short life time
RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
RWH_WRITE_LIFE_LONG Data written has a long life time
RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time
The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.
Add an fcntl interface for querying these flags, and also for
setting them as well:
F_GET_RW_HINT Returns the read/write hint set on the
underlying inode.
F_SET_RW_HINT Set one of the above write hints on the
underlying inode.
F_GET_FILE_RW_HINT Returns the read/write hint set on the
file descriptor.
F_SET_FILE_RW_HINT Set one of the above write hints on the
file descriptor.
The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.
Sample program testing/implementing basic setting/getting of write
hints is below.
Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.
This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.
/*
* writehint.c: get or set an inode write hint
*/
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdbool.h>
#include <inttypes.h>
#ifndef F_GET_RW_HINT
#define F_LINUX_SPECIFIC_BASE 1024
#define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
#define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
#endif
static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };
int main(int argc, char *argv[])
{
uint64_t hint;
int fd, ret;
if (argc < 2) {
fprintf(stderr, "%s: file <hint>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 2;
}
if (argc > 2) {
hint = atoi(argv[2]);
ret = fcntl(fd, F_SET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_SET_RW_HINT");
return 4;
}
}
ret = fcntl(fd, F_GET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_GET_RW_HINT");
return 3;
}
printf("%s: hint %s\n", argv[1], str[hint]);
close(fd);
return 0;
}
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28 00:47:04 +07:00
|
|
|
*/
|
|
|
|
enum rw_hint {
|
|
|
|
WRITE_LIFE_NOT_SET = 0,
|
|
|
|
WRITE_LIFE_NONE = RWH_WRITE_LIFE_NONE,
|
|
|
|
WRITE_LIFE_SHORT = RWH_WRITE_LIFE_SHORT,
|
|
|
|
WRITE_LIFE_MEDIUM = RWH_WRITE_LIFE_MEDIUM,
|
|
|
|
WRITE_LIFE_LONG = RWH_WRITE_LIFE_LONG,
|
|
|
|
WRITE_LIFE_EXTREME = RWH_WRITE_LIFE_EXTREME,
|
|
|
|
};
|
|
|
|
|
2015-02-22 23:58:50 +07:00
|
|
|
#define IOCB_EVENTFD (1 << 0)
|
2015-04-10 00:52:01 +07:00
|
|
|
#define IOCB_APPEND (1 << 1)
|
|
|
|
#define IOCB_DIRECT (1 << 2)
|
2016-03-03 22:04:01 +07:00
|
|
|
#define IOCB_HIPRI (1 << 3)
|
2016-04-07 22:52:00 +07:00
|
|
|
#define IOCB_DSYNC (1 << 4)
|
|
|
|
#define IOCB_SYNC (1 << 5)
|
2016-10-30 23:42:04 +07:00
|
|
|
#define IOCB_WRITE (1 << 6)
|
2017-06-20 19:05:43 +07:00
|
|
|
#define IOCB_NOWAIT (1 << 7)
|
2020-05-22 22:12:09 +07:00
|
|
|
/* iocb->ki_waitq is valid */
|
|
|
|
#define IOCB_WAITQ (1 << 8)
|
2019-11-22 06:25:07 +07:00
|
|
|
#define IOCB_NOIO (1 << 9)
|
2015-02-22 23:58:50 +07:00
|
|
|
|
|
|
|
struct kiocb {
|
|
|
|
struct file *ki_filp;
|
aio: simplify - and fix - fget/fput for io_submit()
Al Viro root-caused a race where the IOCB_CMD_POLL handling of
fget/fput() could cause us to access the file pointer after it had
already been freed:
"In more details - normally IOCB_CMD_POLL handling looks so:
1) io_submit(2) allocates aio_kiocb instance and passes it to
aio_poll()
2) aio_poll() resolves the descriptor to struct file by req->file =
fget(iocb->aio_fildes)
3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
aio_kiocb to 2 (bumps by 1, that is).
4) aio_poll() calls vfs_poll(). After sanity checks (basically,
"poll_wait() had been called and only once") it locks the queue.
That's what the extra reference to iocb had been for - we know we
can safely access it.
5) With queue locked, we check if ->woken has already been set to
true (by aio_poll_wake()) and, if it had been, we unlock the
queue, drop a reference to aio_kiocb and bugger off - at that
point it's a responsibility to aio_poll_wake() and the stuff
called/scheduled by it. That code will drop the reference to file
in req->file, along with the other reference to our aio_kiocb.
6) otherwise, we see whether we need to wait. If we do, we unlock the
queue, drop one reference to aio_kiocb and go away - eventual
wakeup (or cancel) will deal with the reference to file and with
the other reference to aio_kiocb
7) otherwise we remove ourselves from waitqueue (still under the
queue lock), so that wakeup won't get us. No async activity will
be happening, so we can safely drop req->file and iocb ourselves.
If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
won't get freed under us, so we can do all the checks and locking
safely. And we don't touch ->file if we detect that case.
However, vfs_poll() most certainly *does* touch the file it had been
given. So wakeup coming while we are still in ->poll() might end up
doing fput() on that file. That case is not too rare, and usually we
are saved by the still present reference from descriptor table - that
fput() is not the final one.
But if another thread closes that descriptor right after our fget()
and wakeup does happen before ->poll() returns, we are in trouble -
final fput() done while we are in the middle of a method:
Al also wrote a patch to take an extra reference to the file descriptor
to fix this, but I instead suggested we just streamline the whole file
pointer handling by submit_io() so that the generic aio submission code
simply keeps the file pointer around until the aio has completed.
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-04 05:23:33 +07:00
|
|
|
|
|
|
|
/* The 'ki_filp' pointer is shared in a union for aio */
|
|
|
|
randomized_struct_fields_start
|
|
|
|
|
2015-02-22 23:58:50 +07:00
|
|
|
loff_t ki_pos;
|
|
|
|
void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
|
|
|
|
void *private;
|
|
|
|
int ki_flags;
|
2018-05-23 00:52:18 +07:00
|
|
|
u16 ki_hint;
|
2018-05-23 00:52:19 +07:00
|
|
|
u16 ki_ioprio; /* See linux/ioprio.h */
|
2020-05-22 22:12:09 +07:00
|
|
|
union {
|
|
|
|
unsigned int ki_cookie; /* for ->iopoll */
|
|
|
|
struct wait_page_queue *ki_waitq; /* for async buffered IO */
|
|
|
|
};
|
aio: simplify - and fix - fget/fput for io_submit()
Al Viro root-caused a race where the IOCB_CMD_POLL handling of
fget/fput() could cause us to access the file pointer after it had
already been freed:
"In more details - normally IOCB_CMD_POLL handling looks so:
1) io_submit(2) allocates aio_kiocb instance and passes it to
aio_poll()
2) aio_poll() resolves the descriptor to struct file by req->file =
fget(iocb->aio_fildes)
3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
aio_kiocb to 2 (bumps by 1, that is).
4) aio_poll() calls vfs_poll(). After sanity checks (basically,
"poll_wait() had been called and only once") it locks the queue.
That's what the extra reference to iocb had been for - we know we
can safely access it.
5) With queue locked, we check if ->woken has already been set to
true (by aio_poll_wake()) and, if it had been, we unlock the
queue, drop a reference to aio_kiocb and bugger off - at that
point it's a responsibility to aio_poll_wake() and the stuff
called/scheduled by it. That code will drop the reference to file
in req->file, along with the other reference to our aio_kiocb.
6) otherwise, we see whether we need to wait. If we do, we unlock the
queue, drop one reference to aio_kiocb and go away - eventual
wakeup (or cancel) will deal with the reference to file and with
the other reference to aio_kiocb
7) otherwise we remove ourselves from waitqueue (still under the
queue lock), so that wakeup won't get us. No async activity will
be happening, so we can safely drop req->file and iocb ourselves.
If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
won't get freed under us, so we can do all the checks and locking
safely. And we don't touch ->file if we detect that case.
However, vfs_poll() most certainly *does* touch the file it had been
given. So wakeup coming while we are still in ->poll() might end up
doing fput() on that file. That case is not too rare, and usually we
are saved by the still present reference from descriptor table - that
fput() is not the final one.
But if another thread closes that descriptor right after our fget()
and wakeup does happen before ->poll() returns, we are in trouble -
final fput() done while we are in the middle of a method:
Al also wrote a patch to take an extra reference to the file descriptor
to fix this, but I instead suggested we just streamline the whole file
pointer handling by submit_io() so that the generic aio submission code
simply keeps the file pointer around until the aio has completed.
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-04 05:23:33 +07:00
|
|
|
|
|
|
|
randomized_struct_fields_end
|
|
|
|
};
|
2015-02-22 23:58:50 +07:00
|
|
|
|
|
|
|
static inline bool is_sync_kiocb(struct kiocb *kiocb)
|
|
|
|
{
|
|
|
|
return kiocb->ki_complete == NULL;
|
|
|
|
}
|
|
|
|
|
vfs: pagecache usage optimization for pagesize!=blocksize
When we read some part of a file through pagecache, if there is a
pagecache of corresponding index but this page is not uptodate, read IO
is issued and this page will be uptodate.
I think this is good for pagesize == blocksize environment but there is
room for improvement on pagesize != blocksize environment. Because in
this case a page can have multiple buffers and even if a page is not
uptodate, some buffers can be uptodate.
So I suggest that when all buffers which correspond to a part of a file
that we want to read are uptodate, use this pagecache and copy data from
this pagecache to user buffer even if a page is not uptodate. This can
reduce read IO and improve system throughput.
I wrote a benchmark program and got result number with this program.
This benchmark do:
1: mount and open a test file.
2: create a 512MB file.
3: close a file and umount.
4: mount and again open a test file.
5: pwrite randomly 300000 times on a test file. offset is aligned
by IO size(1024bytes).
6: measure time of preading randomly 100000 times on a test file.
The result was:
2.6.26
330 sec
2.6.26-patched
226 sec
Arch:i386
Filesystem:ext3
Blocksize:1024 bytes
Memory: 1GB
On ext3/4, a file is written through buffer/block. So random read/write
mixed workloads or random read after random write workloads are optimized
with this patch under pagesize != blocksize environment. This test result
showed this.
The benchmark program is as follows:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#define LEN 1024
#define LOOP 1024*512 /* 512MB */
main(void)
{
unsigned long i, offset, filesize;
int fd;
char buf[LEN];
time_t t1, t2;
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
memset(buf, 0, LEN);
fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
for (i = 0; i < LOOP; i++)
write(fd, buf, LEN);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
fd = open("/root/test1/testfile", O_RDWR);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
filesize = LEN * LOOP;
for (i = 0; i < 300000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pwrite(fd, buf, LEN, offset);
}
printf("start test\n");
time(&t1);
for (i = 0; i < 100000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pread(fd, buf, LEN, offset);
}
time(&t2);
printf("%ld sec\n", t2-t1);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
}
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-29 05:46:36 +07:00
|
|
|
/*
|
|
|
|
* "descriptor" for what we're up to with a read.
|
|
|
|
* This allows us to use the same read code yet
|
|
|
|
* have multiple different users of the data that
|
|
|
|
* we read from a file.
|
|
|
|
*
|
|
|
|
* The simplest case just copies the data to user
|
|
|
|
* mode.
|
|
|
|
*/
|
|
|
|
typedef struct {
|
|
|
|
size_t written;
|
|
|
|
size_t count;
|
|
|
|
union {
|
|
|
|
char __user *buf;
|
|
|
|
void *data;
|
|
|
|
} arg;
|
|
|
|
int error;
|
|
|
|
} read_descriptor_t;
|
|
|
|
|
|
|
|
typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
|
|
|
|
unsigned long, unsigned long);
|
2007-10-16 15:24:59 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
struct address_space_operations {
|
|
|
|
int (*writepage)(struct page *page, struct writeback_control *wbc);
|
|
|
|
int (*readpage)(struct file *, struct page *);
|
|
|
|
|
|
|
|
/* Write back some dirty pages from this mapping. */
|
|
|
|
int (*writepages)(struct address_space *, struct writeback_control *);
|
|
|
|
|
2006-03-24 18:18:11 +07:00
|
|
|
/* Set a page dirty. Return true if this dirtied it */
|
2005-04-17 05:20:36 +07:00
|
|
|
int (*set_page_dirty)(struct page *page);
|
|
|
|
|
2018-08-18 05:45:36 +07:00
|
|
|
/*
|
|
|
|
* Reads in the requested pages. Unlike ->readpage(), this is
|
|
|
|
* PURELY used for read-ahead!.
|
|
|
|
*/
|
2005-04-17 05:20:36 +07:00
|
|
|
int (*readpages)(struct file *filp, struct address_space *mapping,
|
|
|
|
struct list_head *pages, unsigned nr_pages);
|
2020-06-02 11:46:44 +07:00
|
|
|
void (*readahead)(struct readahead_control *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-10-16 15:25:01 +07:00
|
|
|
int (*write_begin)(struct file *, struct address_space *mapping,
|
|
|
|
loff_t pos, unsigned len, unsigned flags,
|
|
|
|
struct page **pagep, void **fsdata);
|
|
|
|
int (*write_end)(struct file *, struct address_space *mapping,
|
|
|
|
loff_t pos, unsigned len, unsigned copied,
|
|
|
|
struct page *page, void *fsdata);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
|
|
|
|
sector_t (*bmap)(struct address_space *, sector_t);
|
2013-05-22 10:17:23 +07:00
|
|
|
void (*invalidatepage) (struct page *, unsigned int, unsigned int);
|
2005-10-21 14:20:48 +07:00
|
|
|
int (*releasepage) (struct page *, gfp_t);
|
2010-12-02 01:35:19 +07:00
|
|
|
void (*freepage)(struct page *);
|
2016-04-07 22:51:58 +07:00
|
|
|
ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
|
2012-01-13 08:19:34 +07:00
|
|
|
/*
|
2013-07-09 06:00:21 +07:00
|
|
|
* migrate the contents of a page to the specified target. If
|
|
|
|
* migrate_mode is MIGRATE_ASYNC, it must not block.
|
2012-01-13 08:19:34 +07:00
|
|
|
*/
|
2006-06-23 16:03:33 +07:00
|
|
|
int (*migratepage) (struct address_space *,
|
2012-01-13 08:19:43 +07:00
|
|
|
struct page *, struct page *, enum migrate_mode);
|
mm: migrate: support non-lru movable page migration
We have allowed migration for only LRU pages until now and it was enough
to make high-order pages. But recently, embedded system(e.g., webOS,
android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
have seen several reports about troubles of small high-order allocation.
For fixing the problem, there were several efforts (e,g,. enhance
compaction algorithm, SLUB fallback to 0-order page, reserved memory,
vmalloc and so on) but if there are lots of non-movable pages in system,
their solutions are void in the long run.
So, this patch is to support facility to change non-movable pages with
movable. For the feature, this patch introduces functions related to
migration to address_space_operations as well as some page flags.
If a driver want to make own pages movable, it should define three
functions which are function pointers of struct
address_space_operations.
1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the
page as PG_isolated so concurrent isolation in several CPUs skip the
page for isolation. If a driver cannot isolate the page, it should
return *false*.
Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.
2. int (*migratepage) (struct address_space *mapping,
struct page *newpage, struct page *oldpage, enum migrate_mode);
After isolation, VM calls migratepage of driver with isolated page. The
function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should
indicate to the VM the oldpage is no longer movable via
__ClearPageMovable() under page_lock if you migrated the oldpage
successfully and returns 0. If driver cannot migrate the page at the
moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
migration in a short time because VM interprets -EAGAIN as "temporal
migration failure". On returning any error except -EAGAIN, VM will give
up the page migration without retrying in this time.
Driver shouldn't touch page.lru field VM using in the functions.
3. void (*putback_page)(struct page *);
If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed
page. In this function, driver should put the isolated page back to the
own data structure.
4. non-lru movable page flags
There are two page flags for supporting non-lru movable page.
* PG_movable
Driver should use the below function to make page movable under
page_lock.
void __SetPageMovable(struct page *page, struct address_space *mapping)
It needs argument of address_space for registering migration family
functions which will be called by VM. Exactly speaking, PG_movable is
not a real flag of struct page. Rather than, VM reuses page->mapping's
lower bits to represent it.
#define PAGE_MAPPING_MOVABLE 0x2
page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
so driver shouldn't access page->mapping directly. Instead, driver
should use page_mapping which mask off the low two bits of page->mapping
so it can get right struct address_space.
For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page. As
well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
__ClearPageMovable). But __PageMovable is cheap to catch whether page
is LRU or non-lru movable once the page has been isolated. Because LRU
pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more
expensive checking with lock_page in pfn scanning to select victim.
For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents
sudden destroying of page->mapping.
Driver using __SetPageMovable should clear the flag via
__ClearMovablePage under page_lock before the releasing the page.
* PG_isolated
To prevent concurrent isolation among several CPUs, VM marks isolated
page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
non-lru movable page, it can skip it. Driver doesn't need to manipulate
the flag because VM will set/clear it automatically. Keep in mind that
if driver sees PG_isolated page, it means the page have been isolated by
VM so it shouldn't touch page.lru field. PG_isolated is alias with
PG_reclaim flag so driver shouldn't use the flag for own purpose.
[opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-27 05:23:05 +07:00
|
|
|
bool (*isolate_page)(struct page *, isolate_mode_t);
|
|
|
|
void (*putback_page)(struct page *);
|
2007-01-11 14:15:39 +07:00
|
|
|
int (*launder_page) (struct page *);
|
2014-02-03 09:16:54 +07:00
|
|
|
int (*is_partially_uptodate) (struct page *, unsigned long,
|
vfs: pagecache usage optimization for pagesize!=blocksize
When we read some part of a file through pagecache, if there is a
pagecache of corresponding index but this page is not uptodate, read IO
is issued and this page will be uptodate.
I think this is good for pagesize == blocksize environment but there is
room for improvement on pagesize != blocksize environment. Because in
this case a page can have multiple buffers and even if a page is not
uptodate, some buffers can be uptodate.
So I suggest that when all buffers which correspond to a part of a file
that we want to read are uptodate, use this pagecache and copy data from
this pagecache to user buffer even if a page is not uptodate. This can
reduce read IO and improve system throughput.
I wrote a benchmark program and got result number with this program.
This benchmark do:
1: mount and open a test file.
2: create a 512MB file.
3: close a file and umount.
4: mount and again open a test file.
5: pwrite randomly 300000 times on a test file. offset is aligned
by IO size(1024bytes).
6: measure time of preading randomly 100000 times on a test file.
The result was:
2.6.26
330 sec
2.6.26-patched
226 sec
Arch:i386
Filesystem:ext3
Blocksize:1024 bytes
Memory: 1GB
On ext3/4, a file is written through buffer/block. So random read/write
mixed workloads or random read after random write workloads are optimized
with this patch under pagesize != blocksize environment. This test result
showed this.
The benchmark program is as follows:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#define LEN 1024
#define LOOP 1024*512 /* 512MB */
main(void)
{
unsigned long i, offset, filesize;
int fd;
char buf[LEN];
time_t t1, t2;
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
memset(buf, 0, LEN);
fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
for (i = 0; i < LOOP; i++)
write(fd, buf, LEN);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
fd = open("/root/test1/testfile", O_RDWR);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
filesize = LEN * LOOP;
for (i = 0; i < 300000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pwrite(fd, buf, LEN, offset);
}
printf("start test\n");
time(&t1);
for (i = 0; i < 100000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pread(fd, buf, LEN, offset);
}
time(&t2);
printf("%ld sec\n", t2-t1);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
}
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-29 05:46:36 +07:00
|
|
|
unsigned long);
|
2013-07-04 05:02:05 +07:00
|
|
|
void (*is_dirty_writeback) (struct page *, bool *, bool *);
|
2009-09-16 16:50:13 +07:00
|
|
|
int (*error_remove_page)(struct address_space *, struct page *);
|
2012-08-01 06:44:55 +07:00
|
|
|
|
|
|
|
/* swapfile support */
|
2012-08-01 06:44:57 +07:00
|
|
|
int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
|
|
|
|
sector_t *span);
|
|
|
|
void (*swap_deactivate)(struct file *file);
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
2011-04-06 04:51:48 +07:00
|
|
|
extern const struct address_space_operations empty_aops;
|
|
|
|
|
2007-10-16 15:25:01 +07:00
|
|
|
/*
|
|
|
|
* pagecache_write_begin/pagecache_write_end must be used by general code
|
|
|
|
* to write into the pagecache.
|
|
|
|
*/
|
|
|
|
int pagecache_write_begin(struct file *, struct address_space *mapping,
|
|
|
|
loff_t pos, unsigned len, unsigned flags,
|
|
|
|
struct page **pagep, void **fsdata);
|
|
|
|
|
|
|
|
int pagecache_write_end(struct file *, struct address_space *mapping,
|
|
|
|
loff_t pos, unsigned len, unsigned copied,
|
|
|
|
struct page *page, void *fsdata);
|
|
|
|
|
2018-03-06 10:46:03 +07:00
|
|
|
/**
|
|
|
|
* struct address_space - Contents of a cacheable, mappable object.
|
|
|
|
* @host: Owner, either the inode or the block_device.
|
|
|
|
* @i_pages: Cached pages.
|
|
|
|
* @gfp_mask: Memory allocation flags to use for allocating pages.
|
|
|
|
* @i_mmap_writable: Number of VM_SHARED mappings.
|
2019-09-24 05:38:03 +07:00
|
|
|
* @nr_thps: Number of THPs in the pagecache (non-shmem only).
|
2018-03-06 10:46:03 +07:00
|
|
|
* @i_mmap: Tree of private and shared mappings.
|
|
|
|
* @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
|
|
|
|
* @nrpages: Number of page entries, protected by the i_pages lock.
|
|
|
|
* @nrexceptional: Shadow or DAX entries, protected by the i_pages lock.
|
|
|
|
* @writeback_index: Writeback starts here.
|
|
|
|
* @a_ops: Methods.
|
|
|
|
* @flags: Error bits and flags (AS_*).
|
|
|
|
* @wb_err: The most recent error which has occurred.
|
|
|
|
* @private_lock: For use by the owner of the address_space.
|
|
|
|
* @private_list: For use by the owner of the address_space.
|
|
|
|
* @private_data: For use by the owner of the address_space.
|
|
|
|
*/
|
2005-04-17 05:20:36 +07:00
|
|
|
struct address_space {
|
2018-03-06 10:46:03 +07:00
|
|
|
struct inode *host;
|
|
|
|
struct xarray i_pages;
|
|
|
|
gfp_t gfp_mask;
|
|
|
|
atomic_t i_mmap_writable;
|
2019-09-24 05:38:03 +07:00
|
|
|
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
|
|
|
|
/* number of thp, only for non-shmem files */
|
|
|
|
atomic_t nr_thps;
|
|
|
|
#endif
|
2018-03-06 10:46:03 +07:00
|
|
|
struct rb_root_cached i_mmap;
|
|
|
|
struct rw_semaphore i_mmap_rwsem;
|
|
|
|
unsigned long nrpages;
|
2016-01-23 06:10:40 +07:00
|
|
|
unsigned long nrexceptional;
|
2018-03-06 10:46:03 +07:00
|
|
|
pgoff_t writeback_index;
|
|
|
|
const struct address_space_operations *a_ops;
|
|
|
|
unsigned long flags;
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 18:02:25 +07:00
|
|
|
errseq_t wb_err;
|
2018-03-06 10:46:03 +07:00
|
|
|
spinlock_t private_lock;
|
|
|
|
struct list_head private_list;
|
|
|
|
void *private_data;
|
2016-10-28 15:22:25 +07:00
|
|
|
} __attribute__((aligned(sizeof(long)))) __randomize_layout;
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* On most architectures that alignment is already the case; but
|
2011-03-31 08:57:33 +07:00
|
|
|
* must be enforced here for CRIS, to let the least significant bit
|
2005-04-17 05:20:36 +07:00
|
|
|
* of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON.
|
|
|
|
*/
|
|
|
|
|
2017-11-22 23:41:23 +07:00
|
|
|
/* XArray tags, for tagging dirty and writeback pages in the pagecache. */
|
|
|
|
#define PAGECACHE_TAG_DIRTY XA_MARK_0
|
|
|
|
#define PAGECACHE_TAG_WRITEBACK XA_MARK_1
|
|
|
|
#define PAGECACHE_TAG_TOWRITE XA_MARK_2
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2017-11-22 23:41:23 +07:00
|
|
|
* Returns true if any of the pages in the mapping are marked with the tag.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2017-11-22 23:41:23 +07:00
|
|
|
static inline bool mapping_tagged(struct address_space *mapping, xa_mark_t tag)
|
|
|
|
{
|
|
|
|
return xa_marked(&mapping->i_pages, tag);
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
mm,fs: introduce helpers around the i_mmap_mutex
This series is a continuation of the conversion of the i_mmap_mutex to
rwsem, following what we have for the anon memory counterpart. With
Hugh's feedback from the first iteration.
Ultimately, the most obvious paths that require exclusive ownership of the
lock is when we modify the VMA interval tree, via
vma_interval_tree_insert() and vma_interval_tree_remove() families. Cases
such as unmapping, where the ptes content is changed but the tree remains
untouched should make it safe to share the i_mmap_rwsem.
As such, the code of course is straightforward, however the devil is very
much in the details. While its been tested on a number of workloads
without anything exploding, I would not be surprised if there are some
less documented/known assumptions about the lock that could suffer from
these changes. Or maybe I'm just missing something, but either way I
believe its at the point where it could use more eyes and hopefully some
time in linux-next.
Because the lock type conversion is the heart of this patchset,
its worth noting a few comparisons between mutex vs rwsem (xadd):
(i) Same size, no extra footprint.
(ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for
exclusive lock ownership.
(iii) Both can be slightly unfair wrt exclusive ownership, with
writer lock stealing properties, not necessarily respecting
FIFO order for granting the lock when contended.
(iv) Mutexes can be slightly faster than rwsems when
the lock is non-contended.
(v) Both suck at performance for debug (slowpaths), which
shouldn't matter anyway.
Sharing the lock is obviously beneficial, and sem writer ownership is
close enough to mutexes. The biggest winner of these changes is
migration.
As for concrete numbers, the following performance results are for a
4-socket 60-core IvyBridge-EX with 130Gb of RAM.
Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite well
with this set, with a steady ~60% throughput (jpm) increase for alltests
and up to ~30% for disk for high amounts of concurrency. Lower counts of
workload users (< 100) does not show much difference at all, so at least
no regressions.
3.18-rc1 3.18-rc1-i_mmap_rwsem
alltests-100 17918.72 ( 0.00%) 28417.97 ( 58.59%)
alltests-200 16529.39 ( 0.00%) 26807.92 ( 62.18%)
alltests-300 16591.17 ( 0.00%) 26878.08 ( 62.00%)
alltests-400 16490.37 ( 0.00%) 26664.63 ( 61.70%)
alltests-500 16593.17 ( 0.00%) 26433.72 ( 59.30%)
alltests-600 16508.56 ( 0.00%) 26409.20 ( 59.97%)
alltests-700 16508.19 ( 0.00%) 26298.58 ( 59.31%)
alltests-800 16437.58 ( 0.00%) 26433.02 ( 60.81%)
alltests-900 16418.35 ( 0.00%) 26241.61 ( 59.83%)
alltests-1000 16369.00 ( 0.00%) 26195.76 ( 60.03%)
alltests-1100 16330.11 ( 0.00%) 26133.46 ( 60.03%)
alltests-1200 16341.30 ( 0.00%) 26084.03 ( 59.62%)
alltests-1300 16304.75 ( 0.00%) 26024.74 ( 59.61%)
alltests-1400 16231.08 ( 0.00%) 25952.35 ( 59.89%)
alltests-1500 16168.06 ( 0.00%) 25850.58 ( 59.89%)
alltests-1600 16142.56 ( 0.00%) 25767.42 ( 59.62%)
alltests-1700 16118.91 ( 0.00%) 25689.58 ( 59.38%)
alltests-1800 16068.06 ( 0.00%) 25599.71 ( 59.32%)
alltests-1900 16046.94 ( 0.00%) 25525.92 ( 59.07%)
alltests-2000 16007.26 ( 0.00%) 25513.07 ( 59.38%)
disk-100 7582.14 ( 0.00%) 7257.48 ( -4.28%)
disk-200 6962.44 ( 0.00%) 7109.15 ( 2.11%)
disk-300 6435.93 ( 0.00%) 6904.75 ( 7.28%)
disk-400 6370.84 ( 0.00%) 6861.26 ( 7.70%)
disk-500 6353.42 ( 0.00%) 6846.71 ( 7.76%)
disk-600 6368.82 ( 0.00%) 6806.75 ( 6.88%)
disk-700 6331.37 ( 0.00%) 6796.01 ( 7.34%)
disk-800 6324.22 ( 0.00%) 6788.00 ( 7.33%)
disk-900 6253.52 ( 0.00%) 6750.43 ( 7.95%)
disk-1000 6242.53 ( 0.00%) 6855.11 ( 9.81%)
disk-1100 6234.75 ( 0.00%) 6858.47 ( 10.00%)
disk-1200 6312.76 ( 0.00%) 6845.13 ( 8.43%)
disk-1300 6309.95 ( 0.00%) 6834.51 ( 8.31%)
disk-1400 6171.76 ( 0.00%) 6787.09 ( 9.97%)
disk-1500 6139.81 ( 0.00%) 6761.09 ( 10.12%)
disk-1600 4807.12 ( 0.00%) 6725.33 ( 39.90%)
disk-1700 4669.50 ( 0.00%) 5985.38 ( 28.18%)
disk-1800 4663.51 ( 0.00%) 5972.99 ( 28.08%)
disk-1900 4674.31 ( 0.00%) 5949.94 ( 27.29%)
disk-2000 4668.36 ( 0.00%) 5834.93 ( 24.99%)
In addition, a 67.5% increase in successfully migrated NUMA pages, thus
improving node locality.
The patch layout is simple but designed for bisection (in case reversion
is needed if the changes break upstream) and easier review:
o Patches 1-4 convert the i_mmap lock from mutex to rwsem.
o Patches 5-10 share the lock in specific paths, each patch
details the rationale behind why it should be safe.
This patchset has been tested with: postgres 9.4 (with brand new hugetlb
support), hugetlbfs test suite (all tests pass, in fact more tests pass
with these changes than with an upstream kernel), ltp, aim7 benchmarks,
memcached and iozone with the -B option for mmap'ing. *Untested* paths
are nommu, memory-failure, uprobes and xip.
This patch (of 8):
Various parts of the kernel acquire and release this mutex, so add
i_mmap_lock_write() and immap_unlock_write() helper functions that will
encapsulate this logic. The next patch will make use of these.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 07:54:18 +07:00
|
|
|
static inline void i_mmap_lock_write(struct address_space *mapping)
|
|
|
|
{
|
2014-12-13 07:54:24 +07:00
|
|
|
down_write(&mapping->i_mmap_rwsem);
|
mm,fs: introduce helpers around the i_mmap_mutex
This series is a continuation of the conversion of the i_mmap_mutex to
rwsem, following what we have for the anon memory counterpart. With
Hugh's feedback from the first iteration.
Ultimately, the most obvious paths that require exclusive ownership of the
lock is when we modify the VMA interval tree, via
vma_interval_tree_insert() and vma_interval_tree_remove() families. Cases
such as unmapping, where the ptes content is changed but the tree remains
untouched should make it safe to share the i_mmap_rwsem.
As such, the code of course is straightforward, however the devil is very
much in the details. While its been tested on a number of workloads
without anything exploding, I would not be surprised if there are some
less documented/known assumptions about the lock that could suffer from
these changes. Or maybe I'm just missing something, but either way I
believe its at the point where it could use more eyes and hopefully some
time in linux-next.
Because the lock type conversion is the heart of this patchset,
its worth noting a few comparisons between mutex vs rwsem (xadd):
(i) Same size, no extra footprint.
(ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for
exclusive lock ownership.
(iii) Both can be slightly unfair wrt exclusive ownership, with
writer lock stealing properties, not necessarily respecting
FIFO order for granting the lock when contended.
(iv) Mutexes can be slightly faster than rwsems when
the lock is non-contended.
(v) Both suck at performance for debug (slowpaths), which
shouldn't matter anyway.
Sharing the lock is obviously beneficial, and sem writer ownership is
close enough to mutexes. The biggest winner of these changes is
migration.
As for concrete numbers, the following performance results are for a
4-socket 60-core IvyBridge-EX with 130Gb of RAM.
Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite well
with this set, with a steady ~60% throughput (jpm) increase for alltests
and up to ~30% for disk for high amounts of concurrency. Lower counts of
workload users (< 100) does not show much difference at all, so at least
no regressions.
3.18-rc1 3.18-rc1-i_mmap_rwsem
alltests-100 17918.72 ( 0.00%) 28417.97 ( 58.59%)
alltests-200 16529.39 ( 0.00%) 26807.92 ( 62.18%)
alltests-300 16591.17 ( 0.00%) 26878.08 ( 62.00%)
alltests-400 16490.37 ( 0.00%) 26664.63 ( 61.70%)
alltests-500 16593.17 ( 0.00%) 26433.72 ( 59.30%)
alltests-600 16508.56 ( 0.00%) 26409.20 ( 59.97%)
alltests-700 16508.19 ( 0.00%) 26298.58 ( 59.31%)
alltests-800 16437.58 ( 0.00%) 26433.02 ( 60.81%)
alltests-900 16418.35 ( 0.00%) 26241.61 ( 59.83%)
alltests-1000 16369.00 ( 0.00%) 26195.76 ( 60.03%)
alltests-1100 16330.11 ( 0.00%) 26133.46 ( 60.03%)
alltests-1200 16341.30 ( 0.00%) 26084.03 ( 59.62%)
alltests-1300 16304.75 ( 0.00%) 26024.74 ( 59.61%)
alltests-1400 16231.08 ( 0.00%) 25952.35 ( 59.89%)
alltests-1500 16168.06 ( 0.00%) 25850.58 ( 59.89%)
alltests-1600 16142.56 ( 0.00%) 25767.42 ( 59.62%)
alltests-1700 16118.91 ( 0.00%) 25689.58 ( 59.38%)
alltests-1800 16068.06 ( 0.00%) 25599.71 ( 59.32%)
alltests-1900 16046.94 ( 0.00%) 25525.92 ( 59.07%)
alltests-2000 16007.26 ( 0.00%) 25513.07 ( 59.38%)
disk-100 7582.14 ( 0.00%) 7257.48 ( -4.28%)
disk-200 6962.44 ( 0.00%) 7109.15 ( 2.11%)
disk-300 6435.93 ( 0.00%) 6904.75 ( 7.28%)
disk-400 6370.84 ( 0.00%) 6861.26 ( 7.70%)
disk-500 6353.42 ( 0.00%) 6846.71 ( 7.76%)
disk-600 6368.82 ( 0.00%) 6806.75 ( 6.88%)
disk-700 6331.37 ( 0.00%) 6796.01 ( 7.34%)
disk-800 6324.22 ( 0.00%) 6788.00 ( 7.33%)
disk-900 6253.52 ( 0.00%) 6750.43 ( 7.95%)
disk-1000 6242.53 ( 0.00%) 6855.11 ( 9.81%)
disk-1100 6234.75 ( 0.00%) 6858.47 ( 10.00%)
disk-1200 6312.76 ( 0.00%) 6845.13 ( 8.43%)
disk-1300 6309.95 ( 0.00%) 6834.51 ( 8.31%)
disk-1400 6171.76 ( 0.00%) 6787.09 ( 9.97%)
disk-1500 6139.81 ( 0.00%) 6761.09 ( 10.12%)
disk-1600 4807.12 ( 0.00%) 6725.33 ( 39.90%)
disk-1700 4669.50 ( 0.00%) 5985.38 ( 28.18%)
disk-1800 4663.51 ( 0.00%) 5972.99 ( 28.08%)
disk-1900 4674.31 ( 0.00%) 5949.94 ( 27.29%)
disk-2000 4668.36 ( 0.00%) 5834.93 ( 24.99%)
In addition, a 67.5% increase in successfully migrated NUMA pages, thus
improving node locality.
The patch layout is simple but designed for bisection (in case reversion
is needed if the changes break upstream) and easier review:
o Patches 1-4 convert the i_mmap lock from mutex to rwsem.
o Patches 5-10 share the lock in specific paths, each patch
details the rationale behind why it should be safe.
This patchset has been tested with: postgres 9.4 (with brand new hugetlb
support), hugetlbfs test suite (all tests pass, in fact more tests pass
with these changes than with an upstream kernel), ltp, aim7 benchmarks,
memcached and iozone with the -B option for mmap'ing. *Untested* paths
are nommu, memory-failure, uprobes and xip.
This patch (of 8):
Various parts of the kernel acquire and release this mutex, so add
i_mmap_lock_write() and immap_unlock_write() helper functions that will
encapsulate this logic. The next patch will make use of these.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 07:54:18 +07:00
|
|
|
}
|
|
|
|
|
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races. These issues are:
1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
reserve counts and state.
A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2]. However, those patches were reverted starting with [3]
due to locking issues.
To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing. However, during fault
processing we need to lock the page we will be adding. Lock ordering
requires we take page lock before i_mmap_rwsem. Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.
To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages. This is not too invasive as hugetlbfs
processing is done separate from core mm in many places. However, I don't
really like this idea. Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.
The only other way I can think of to address these issues is by catching
all the races. After catching a race, cleanup, backout, retry ... etc,
as needed. This can get really ugly, especially for huge page
reservations. At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races. Any other
suggestions would be welcome.
[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
[4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
[5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/
This patch (of 2):
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults. This is not the order
specified in the rest of mm code. Handling of hugetlbfs pages is mostly
isolated today. Therefore, we use this alternative locking order for
PageHuge() pages.
mapping->i_mmap_rwsem
hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
page->flags PG_locked (lock_page)
To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.
In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma. A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 11:11:05 +07:00
|
|
|
static inline int i_mmap_trylock_write(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return down_write_trylock(&mapping->i_mmap_rwsem);
|
|
|
|
}
|
|
|
|
|
mm,fs: introduce helpers around the i_mmap_mutex
This series is a continuation of the conversion of the i_mmap_mutex to
rwsem, following what we have for the anon memory counterpart. With
Hugh's feedback from the first iteration.
Ultimately, the most obvious paths that require exclusive ownership of the
lock is when we modify the VMA interval tree, via
vma_interval_tree_insert() and vma_interval_tree_remove() families. Cases
such as unmapping, where the ptes content is changed but the tree remains
untouched should make it safe to share the i_mmap_rwsem.
As such, the code of course is straightforward, however the devil is very
much in the details. While its been tested on a number of workloads
without anything exploding, I would not be surprised if there are some
less documented/known assumptions about the lock that could suffer from
these changes. Or maybe I'm just missing something, but either way I
believe its at the point where it could use more eyes and hopefully some
time in linux-next.
Because the lock type conversion is the heart of this patchset,
its worth noting a few comparisons between mutex vs rwsem (xadd):
(i) Same size, no extra footprint.
(ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for
exclusive lock ownership.
(iii) Both can be slightly unfair wrt exclusive ownership, with
writer lock stealing properties, not necessarily respecting
FIFO order for granting the lock when contended.
(iv) Mutexes can be slightly faster than rwsems when
the lock is non-contended.
(v) Both suck at performance for debug (slowpaths), which
shouldn't matter anyway.
Sharing the lock is obviously beneficial, and sem writer ownership is
close enough to mutexes. The biggest winner of these changes is
migration.
As for concrete numbers, the following performance results are for a
4-socket 60-core IvyBridge-EX with 130Gb of RAM.
Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite well
with this set, with a steady ~60% throughput (jpm) increase for alltests
and up to ~30% for disk for high amounts of concurrency. Lower counts of
workload users (< 100) does not show much difference at all, so at least
no regressions.
3.18-rc1 3.18-rc1-i_mmap_rwsem
alltests-100 17918.72 ( 0.00%) 28417.97 ( 58.59%)
alltests-200 16529.39 ( 0.00%) 26807.92 ( 62.18%)
alltests-300 16591.17 ( 0.00%) 26878.08 ( 62.00%)
alltests-400 16490.37 ( 0.00%) 26664.63 ( 61.70%)
alltests-500 16593.17 ( 0.00%) 26433.72 ( 59.30%)
alltests-600 16508.56 ( 0.00%) 26409.20 ( 59.97%)
alltests-700 16508.19 ( 0.00%) 26298.58 ( 59.31%)
alltests-800 16437.58 ( 0.00%) 26433.02 ( 60.81%)
alltests-900 16418.35 ( 0.00%) 26241.61 ( 59.83%)
alltests-1000 16369.00 ( 0.00%) 26195.76 ( 60.03%)
alltests-1100 16330.11 ( 0.00%) 26133.46 ( 60.03%)
alltests-1200 16341.30 ( 0.00%) 26084.03 ( 59.62%)
alltests-1300 16304.75 ( 0.00%) 26024.74 ( 59.61%)
alltests-1400 16231.08 ( 0.00%) 25952.35 ( 59.89%)
alltests-1500 16168.06 ( 0.00%) 25850.58 ( 59.89%)
alltests-1600 16142.56 ( 0.00%) 25767.42 ( 59.62%)
alltests-1700 16118.91 ( 0.00%) 25689.58 ( 59.38%)
alltests-1800 16068.06 ( 0.00%) 25599.71 ( 59.32%)
alltests-1900 16046.94 ( 0.00%) 25525.92 ( 59.07%)
alltests-2000 16007.26 ( 0.00%) 25513.07 ( 59.38%)
disk-100 7582.14 ( 0.00%) 7257.48 ( -4.28%)
disk-200 6962.44 ( 0.00%) 7109.15 ( 2.11%)
disk-300 6435.93 ( 0.00%) 6904.75 ( 7.28%)
disk-400 6370.84 ( 0.00%) 6861.26 ( 7.70%)
disk-500 6353.42 ( 0.00%) 6846.71 ( 7.76%)
disk-600 6368.82 ( 0.00%) 6806.75 ( 6.88%)
disk-700 6331.37 ( 0.00%) 6796.01 ( 7.34%)
disk-800 6324.22 ( 0.00%) 6788.00 ( 7.33%)
disk-900 6253.52 ( 0.00%) 6750.43 ( 7.95%)
disk-1000 6242.53 ( 0.00%) 6855.11 ( 9.81%)
disk-1100 6234.75 ( 0.00%) 6858.47 ( 10.00%)
disk-1200 6312.76 ( 0.00%) 6845.13 ( 8.43%)
disk-1300 6309.95 ( 0.00%) 6834.51 ( 8.31%)
disk-1400 6171.76 ( 0.00%) 6787.09 ( 9.97%)
disk-1500 6139.81 ( 0.00%) 6761.09 ( 10.12%)
disk-1600 4807.12 ( 0.00%) 6725.33 ( 39.90%)
disk-1700 4669.50 ( 0.00%) 5985.38 ( 28.18%)
disk-1800 4663.51 ( 0.00%) 5972.99 ( 28.08%)
disk-1900 4674.31 ( 0.00%) 5949.94 ( 27.29%)
disk-2000 4668.36 ( 0.00%) 5834.93 ( 24.99%)
In addition, a 67.5% increase in successfully migrated NUMA pages, thus
improving node locality.
The patch layout is simple but designed for bisection (in case reversion
is needed if the changes break upstream) and easier review:
o Patches 1-4 convert the i_mmap lock from mutex to rwsem.
o Patches 5-10 share the lock in specific paths, each patch
details the rationale behind why it should be safe.
This patchset has been tested with: postgres 9.4 (with brand new hugetlb
support), hugetlbfs test suite (all tests pass, in fact more tests pass
with these changes than with an upstream kernel), ltp, aim7 benchmarks,
memcached and iozone with the -B option for mmap'ing. *Untested* paths
are nommu, memory-failure, uprobes and xip.
This patch (of 8):
Various parts of the kernel acquire and release this mutex, so add
i_mmap_lock_write() and immap_unlock_write() helper functions that will
encapsulate this logic. The next patch will make use of these.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 07:54:18 +07:00
|
|
|
static inline void i_mmap_unlock_write(struct address_space *mapping)
|
|
|
|
{
|
2014-12-13 07:54:24 +07:00
|
|
|
up_write(&mapping->i_mmap_rwsem);
|
mm,fs: introduce helpers around the i_mmap_mutex
This series is a continuation of the conversion of the i_mmap_mutex to
rwsem, following what we have for the anon memory counterpart. With
Hugh's feedback from the first iteration.
Ultimately, the most obvious paths that require exclusive ownership of the
lock is when we modify the VMA interval tree, via
vma_interval_tree_insert() and vma_interval_tree_remove() families. Cases
such as unmapping, where the ptes content is changed but the tree remains
untouched should make it safe to share the i_mmap_rwsem.
As such, the code of course is straightforward, however the devil is very
much in the details. While its been tested on a number of workloads
without anything exploding, I would not be surprised if there are some
less documented/known assumptions about the lock that could suffer from
these changes. Or maybe I'm just missing something, but either way I
believe its at the point where it could use more eyes and hopefully some
time in linux-next.
Because the lock type conversion is the heart of this patchset,
its worth noting a few comparisons between mutex vs rwsem (xadd):
(i) Same size, no extra footprint.
(ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for
exclusive lock ownership.
(iii) Both can be slightly unfair wrt exclusive ownership, with
writer lock stealing properties, not necessarily respecting
FIFO order for granting the lock when contended.
(iv) Mutexes can be slightly faster than rwsems when
the lock is non-contended.
(v) Both suck at performance for debug (slowpaths), which
shouldn't matter anyway.
Sharing the lock is obviously beneficial, and sem writer ownership is
close enough to mutexes. The biggest winner of these changes is
migration.
As for concrete numbers, the following performance results are for a
4-socket 60-core IvyBridge-EX with 130Gb of RAM.
Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite well
with this set, with a steady ~60% throughput (jpm) increase for alltests
and up to ~30% for disk for high amounts of concurrency. Lower counts of
workload users (< 100) does not show much difference at all, so at least
no regressions.
3.18-rc1 3.18-rc1-i_mmap_rwsem
alltests-100 17918.72 ( 0.00%) 28417.97 ( 58.59%)
alltests-200 16529.39 ( 0.00%) 26807.92 ( 62.18%)
alltests-300 16591.17 ( 0.00%) 26878.08 ( 62.00%)
alltests-400 16490.37 ( 0.00%) 26664.63 ( 61.70%)
alltests-500 16593.17 ( 0.00%) 26433.72 ( 59.30%)
alltests-600 16508.56 ( 0.00%) 26409.20 ( 59.97%)
alltests-700 16508.19 ( 0.00%) 26298.58 ( 59.31%)
alltests-800 16437.58 ( 0.00%) 26433.02 ( 60.81%)
alltests-900 16418.35 ( 0.00%) 26241.61 ( 59.83%)
alltests-1000 16369.00 ( 0.00%) 26195.76 ( 60.03%)
alltests-1100 16330.11 ( 0.00%) 26133.46 ( 60.03%)
alltests-1200 16341.30 ( 0.00%) 26084.03 ( 59.62%)
alltests-1300 16304.75 ( 0.00%) 26024.74 ( 59.61%)
alltests-1400 16231.08 ( 0.00%) 25952.35 ( 59.89%)
alltests-1500 16168.06 ( 0.00%) 25850.58 ( 59.89%)
alltests-1600 16142.56 ( 0.00%) 25767.42 ( 59.62%)
alltests-1700 16118.91 ( 0.00%) 25689.58 ( 59.38%)
alltests-1800 16068.06 ( 0.00%) 25599.71 ( 59.32%)
alltests-1900 16046.94 ( 0.00%) 25525.92 ( 59.07%)
alltests-2000 16007.26 ( 0.00%) 25513.07 ( 59.38%)
disk-100 7582.14 ( 0.00%) 7257.48 ( -4.28%)
disk-200 6962.44 ( 0.00%) 7109.15 ( 2.11%)
disk-300 6435.93 ( 0.00%) 6904.75 ( 7.28%)
disk-400 6370.84 ( 0.00%) 6861.26 ( 7.70%)
disk-500 6353.42 ( 0.00%) 6846.71 ( 7.76%)
disk-600 6368.82 ( 0.00%) 6806.75 ( 6.88%)
disk-700 6331.37 ( 0.00%) 6796.01 ( 7.34%)
disk-800 6324.22 ( 0.00%) 6788.00 ( 7.33%)
disk-900 6253.52 ( 0.00%) 6750.43 ( 7.95%)
disk-1000 6242.53 ( 0.00%) 6855.11 ( 9.81%)
disk-1100 6234.75 ( 0.00%) 6858.47 ( 10.00%)
disk-1200 6312.76 ( 0.00%) 6845.13 ( 8.43%)
disk-1300 6309.95 ( 0.00%) 6834.51 ( 8.31%)
disk-1400 6171.76 ( 0.00%) 6787.09 ( 9.97%)
disk-1500 6139.81 ( 0.00%) 6761.09 ( 10.12%)
disk-1600 4807.12 ( 0.00%) 6725.33 ( 39.90%)
disk-1700 4669.50 ( 0.00%) 5985.38 ( 28.18%)
disk-1800 4663.51 ( 0.00%) 5972.99 ( 28.08%)
disk-1900 4674.31 ( 0.00%) 5949.94 ( 27.29%)
disk-2000 4668.36 ( 0.00%) 5834.93 ( 24.99%)
In addition, a 67.5% increase in successfully migrated NUMA pages, thus
improving node locality.
The patch layout is simple but designed for bisection (in case reversion
is needed if the changes break upstream) and easier review:
o Patches 1-4 convert the i_mmap lock from mutex to rwsem.
o Patches 5-10 share the lock in specific paths, each patch
details the rationale behind why it should be safe.
This patchset has been tested with: postgres 9.4 (with brand new hugetlb
support), hugetlbfs test suite (all tests pass, in fact more tests pass
with these changes than with an upstream kernel), ltp, aim7 benchmarks,
memcached and iozone with the -B option for mmap'ing. *Untested* paths
are nommu, memory-failure, uprobes and xip.
This patch (of 8):
Various parts of the kernel acquire and release this mutex, so add
i_mmap_lock_write() and immap_unlock_write() helper functions that will
encapsulate this logic. The next patch will make use of these.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 07:54:18 +07:00
|
|
|
}
|
|
|
|
|
2014-12-13 07:54:27 +07:00
|
|
|
static inline void i_mmap_lock_read(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
down_read(&mapping->i_mmap_rwsem);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void i_mmap_unlock_read(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
up_read(&mapping->i_mmap_rwsem);
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Might pages of this file be mapped into userspace?
|
|
|
|
*/
|
|
|
|
static inline int mapping_mapped(struct address_space *mapping)
|
|
|
|
{
|
2017-09-09 06:15:08 +07:00
|
|
|
return !RB_EMPTY_ROOT(&mapping->i_mmap.rb_root);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Might pages of this file have been modified in userspace?
|
2020-08-07 13:23:37 +07:00
|
|
|
* Note that i_mmap_writable counts all VM_SHARED vmas: do_mmap
|
2005-04-17 05:20:36 +07:00
|
|
|
* marks vma as VM_SHARED if it is shared, and the file was opened for
|
|
|
|
* writing i.e. vma may be mprotected writable even if now readonly.
|
2014-08-09 04:25:25 +07:00
|
|
|
*
|
|
|
|
* If i_mmap_writable is negative, no new writable mappings are allowed. You
|
|
|
|
* can only deny writable mappings, if none exists right now.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
|
|
|
static inline int mapping_writably_mapped(struct address_space *mapping)
|
|
|
|
{
|
2014-08-09 04:25:25 +07:00
|
|
|
return atomic_read(&mapping->i_mmap_writable) > 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int mapping_map_writable(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return atomic_inc_unless_negative(&mapping->i_mmap_writable) ?
|
|
|
|
0 : -EPERM;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mapping_unmap_writable(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
atomic_dec(&mapping->i_mmap_writable);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int mapping_deny_writable(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return atomic_dec_unless_positive(&mapping->i_mmap_writable) ?
|
|
|
|
0 : -EBUSY;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mapping_allow_writable(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
atomic_inc(&mapping->i_mmap_writable);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Use sequence counter to get consistent i_size on 32-bit processors.
|
|
|
|
*/
|
|
|
|
#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
|
|
|
|
#include <linux/seqlock.h>
|
|
|
|
#define __NEED_I_SIZE_ORDERED
|
|
|
|
#define i_size_ordered_init(inode) seqcount_init(&inode->i_size_seqcount)
|
|
|
|
#else
|
|
|
|
#define i_size_ordered_init(inode) do { } while (0)
|
|
|
|
#endif
|
|
|
|
|
2009-06-09 06:50:45 +07:00
|
|
|
struct posix_acl;
|
|
|
|
#define ACL_NOT_CACHED ((void *)(-1))
|
2016-09-01 16:11:59 +07:00
|
|
|
#define ACL_DONT_CACHE ((void *)(-3))
|
2009-06-09 06:50:45 +07:00
|
|
|
|
2016-03-24 20:38:37 +07:00
|
|
|
static inline struct posix_acl *
|
|
|
|
uncached_acl_sentinel(struct task_struct *task)
|
|
|
|
{
|
|
|
|
return (void *)task + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool
|
|
|
|
is_uncached_acl(struct posix_acl *acl)
|
|
|
|
{
|
|
|
|
return (long)acl & 1;
|
|
|
|
}
|
|
|
|
|
2011-08-07 12:45:50 +07:00
|
|
|
#define IOP_FASTPERM 0x0001
|
|
|
|
#define IOP_LOOKUP 0x0002
|
|
|
|
#define IOP_NOFOLLOW 0x0004
|
2016-09-29 22:48:39 +07:00
|
|
|
#define IOP_XATTR 0x0008
|
2016-12-09 22:45:04 +07:00
|
|
|
#define IOP_DEFAULT_READLINK 0x0010
|
2011-08-07 12:45:50 +07:00
|
|
|
|
2017-03-14 18:31:02 +07:00
|
|
|
struct fsnotify_mark_connector;
|
|
|
|
|
2011-08-07 12:45:50 +07:00
|
|
|
/*
|
|
|
|
* Keep mostly read-only and often accessed (especially for
|
|
|
|
* the RCU path lookup and 'stat' data) fields at the beginning
|
|
|
|
* of the 'struct inode'
|
|
|
|
*/
|
2005-04-17 05:20:36 +07:00
|
|
|
struct inode {
|
2011-01-07 13:49:56 +07:00
|
|
|
umode_t i_mode;
|
2011-08-07 12:45:50 +07:00
|
|
|
unsigned short i_opflags;
|
2012-02-08 22:07:50 +07:00
|
|
|
kuid_t i_uid;
|
|
|
|
kgid_t i_gid;
|
2011-08-07 12:45:50 +07:00
|
|
|
unsigned int i_flags;
|
|
|
|
|
|
|
|
#ifdef CONFIG_FS_POSIX_ACL
|
|
|
|
struct posix_acl *i_acl;
|
|
|
|
struct posix_acl *i_default_acl;
|
|
|
|
#endif
|
|
|
|
|
2011-01-07 13:49:56 +07:00
|
|
|
const struct inode_operations *i_op;
|
|
|
|
struct super_block *i_sb;
|
2011-08-07 12:45:50 +07:00
|
|
|
struct address_space *i_mapping;
|
2011-01-07 13:49:56 +07:00
|
|
|
|
2011-06-09 05:18:19 +07:00
|
|
|
#ifdef CONFIG_SECURITY
|
|
|
|
void *i_security;
|
|
|
|
#endif
|
2011-01-07 13:49:56 +07:00
|
|
|
|
2011-08-07 12:45:50 +07:00
|
|
|
/* Stat data, not accessed from path walking */
|
|
|
|
unsigned long i_ino;
|
2011-10-28 19:13:30 +07:00
|
|
|
/*
|
|
|
|
* Filesystems may only read i_nlink directly. They shall use the
|
|
|
|
* following functions for modification:
|
|
|
|
*
|
|
|
|
* (set|clear|inc|drop)_nlink
|
|
|
|
* inode_(inc|dec)_link_count
|
|
|
|
*/
|
|
|
|
union {
|
|
|
|
const unsigned int i_nlink;
|
|
|
|
unsigned int __i_nlink;
|
|
|
|
};
|
2011-08-07 12:45:50 +07:00
|
|
|
dev_t i_rdev;
|
2012-06-04 04:50:19 +07:00
|
|
|
loff_t i_size;
|
vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.
The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.
The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.
virtual patch
@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}
@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}
@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}
@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }
@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }
@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)
<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)
@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}
@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-05-09 09:36:02 +07:00
|
|
|
struct timespec64 i_atime;
|
|
|
|
struct timespec64 i_mtime;
|
|
|
|
struct timespec64 i_ctime;
|
2011-10-29 19:24:18 +07:00
|
|
|
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
|
|
|
|
unsigned short i_bytes;
|
2018-07-05 13:25:43 +07:00
|
|
|
u8 i_blkbits;
|
|
|
|
u8 i_write_hint;
|
2011-08-07 12:45:50 +07:00
|
|
|
blkcnt_t i_blocks;
|
|
|
|
|
|
|
|
#ifdef __NEED_I_SIZE_ORDERED
|
|
|
|
seqcount_t i_size_seqcount;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/* Misc */
|
|
|
|
unsigned long i_state;
|
2016-04-16 02:08:36 +07:00
|
|
|
struct rw_semaphore i_rwsem;
|
2011-06-09 05:18:19 +07:00
|
|
|
|
2011-01-07 13:49:56 +07:00
|
|
|
unsigned long dirtied_when; /* jiffies of first dirtying */
|
2015-03-17 23:23:19 +07:00
|
|
|
unsigned long dirtied_time_when;
|
2011-01-07 13:49:56 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
struct hlist_node i_hash;
|
2015-03-05 02:07:22 +07:00
|
|
|
struct list_head i_io_list; /* backing dev IO list */
|
2015-05-23 04:13:37 +07:00
|
|
|
#ifdef CONFIG_CGROUP_WRITEBACK
|
|
|
|
struct bdi_writeback *i_wb; /* the associated cgroup wb */
|
2015-05-29 01:50:51 +07:00
|
|
|
|
|
|
|
/* foreign inode detection, see wbc_detach_inode() */
|
|
|
|
int i_wb_frn_winner;
|
|
|
|
u16 i_wb_frn_avg_time;
|
|
|
|
u16 i_wb_frn_history;
|
2015-05-23 04:13:37 +07:00
|
|
|
#endif
|
2010-10-21 07:49:30 +07:00
|
|
|
struct list_head i_lru; /* inode LRU list */
|
2005-04-17 05:20:36 +07:00
|
|
|
struct list_head i_sb_list;
|
2016-07-27 05:21:50 +07:00
|
|
|
struct list_head i_wb_list; /* backing dev writeback list */
|
2011-01-07 13:49:49 +07:00
|
|
|
union {
|
2012-06-10 00:51:19 +07:00
|
|
|
struct hlist_head i_dentry;
|
2011-01-07 13:49:49 +07:00
|
|
|
struct rcu_head i_rcu;
|
|
|
|
};
|
fs: handle inode->i_version more efficiently
Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.
Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.
When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.
If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.
On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.
This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
2017-12-21 19:45:44 +07:00
|
|
|
atomic64_t i_version;
|
2020-03-04 17:28:31 +07:00
|
|
|
atomic64_t i_sequence; /* see futex */
|
2012-06-04 04:50:19 +07:00
|
|
|
atomic_t i_count;
|
2011-06-25 01:29:43 +07:00
|
|
|
atomic_t i_dio_count;
|
2011-10-29 19:24:18 +07:00
|
|
|
atomic_t i_writecount;
|
2019-06-07 21:24:38 +07:00
|
|
|
#if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
|
2013-12-12 03:20:54 +07:00
|
|
|
atomic_t i_readcount; /* struct files open RO */
|
|
|
|
#endif
|
2019-04-11 01:43:44 +07:00
|
|
|
union {
|
|
|
|
const struct file_operations *i_fop; /* former ->i_op->default_file_ops */
|
|
|
|
void (*free_inode)(struct inode *);
|
|
|
|
};
|
2015-01-17 03:05:54 +07:00
|
|
|
struct file_lock_context *i_flctx;
|
2005-04-17 05:20:36 +07:00
|
|
|
struct address_space i_data;
|
|
|
|
struct list_head i_devices;
|
2006-09-27 15:50:47 +07:00
|
|
|
union {
|
|
|
|
struct pipe_inode_info *i_pipe;
|
2006-09-27 15:50:48 +07:00
|
|
|
struct block_device *i_bdev;
|
2006-09-27 15:50:49 +07:00
|
|
|
struct cdev *i_cdev;
|
2015-05-02 20:54:06 +07:00
|
|
|
char *i_link;
|
parallel lookups machinery, part 2
We'll need to verify that there's neither a hashed nor in-lookup
dentry with desired parent/name before adding to in-lookup set.
One possible solution would be to hold the parent's ->d_lock through
both checks, but while the in-lookup set is relatively small at any
time, dcache is not. And holding the parent's ->d_lock through
something like __d_lookup_rcu() would suck too badly.
So we leave the parent's ->d_lock alone, which means that we watch
out for the following scenario:
* we verify that there's no hashed match
* existing in-lookup match gets hashed by another process
* we verify that there's no in-lookup matches and decide
that everything's fine.
Solution: per-directory kinda-sorta seqlock, bumped around the times
we hash something that used to be in-lookup or move (and hash)
something in place of in-lookup. Then the above would turn into
* read the counter
* do dcache lookup
* if no matches found, check for in-lookup matches
* if there had been none of those either, check if the
counter has changed; repeat if it has.
The "kinda-sorta" part is due to the fact that we don't have much spare
space in inode. There is a spare word (shared with i_bdev/i_cdev/i_pipe),
so the counter part is not a problem, but spinlock is a different story.
We could use the parent's ->d_lock, and it would be less painful in
terms of contention, for __d_add() it would be rather inconvenient to
grab; we could do that (using lock_parent()), but...
Fortunately, we can get serialization on the counter itself, and it
might be a good idea in general; we can use cmpxchg() in a loop to
get from even to odd and smp_store_release() from odd to even.
This commit adds the counter and updating logics; the readers will be
added in the next commit.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-15 11:58:55 +07:00
|
|
|
unsigned i_dir_seq;
|
2006-09-27 15:50:47 +07:00
|
|
|
};
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
__u32 i_generation;
|
|
|
|
|
2009-05-22 04:01:26 +07:00
|
|
|
#ifdef CONFIG_FSNOTIFY
|
|
|
|
__u32 i_fsnotify_mask; /* all events this inode cares about */
|
2017-02-01 15:21:58 +07:00
|
|
|
struct fsnotify_mark_connector __rcu *i_fsnotify_marks;
|
[PATCH] inotify
inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:
* dnotify requires the opening of one fd per each directory
that you intend to watch. This quickly results in too many
open files and pins removable media, preventing unmount.
* dnotify is directory-based. You only learn about changes to
directories. Sure, a change to a file in a directory affects
the directory, but you are then forced to keep a cache of
stat structures.
* dnotify's interface to user-space is awful. Signals?
inotify provides a more usable, simple, powerful solution to file change
notification:
* inotify's interface is a system call that returns a fd, not SIGIO.
You get a single fd, which is select()-able.
* inotify has an event that says "the filesystem that the item
you were watching is on was unmounted."
* inotify can watch directories or files.
Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.
See Documentation/filesystems/inotify.txt.
Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-13 04:06:03 +07:00
|
|
|
#endif
|
|
|
|
|
2018-12-12 16:50:12 +07:00
|
|
|
#ifdef CONFIG_FS_ENCRYPTION
|
2015-05-16 06:26:10 +07:00
|
|
|
struct fscrypt_info *i_crypt_info;
|
|
|
|
#endif
|
|
|
|
|
2019-07-22 23:26:21 +07:00
|
|
|
#ifdef CONFIG_FS_VERITY
|
|
|
|
struct fsverity_info *i_verity_info;
|
|
|
|
#endif
|
|
|
|
|
2006-09-27 15:50:46 +07:00
|
|
|
void *i_private; /* fs or device private pointer */
|
2016-10-28 15:22:25 +07:00
|
|
|
} __randomize_layout;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2018-01-22 09:04:25 +07:00
|
|
|
struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
|
|
|
|
|
2017-02-28 05:28:32 +07:00
|
|
|
static inline unsigned int i_blocksize(const struct inode *node)
|
|
|
|
{
|
|
|
|
return (1 << node->i_blkbits);
|
|
|
|
}
|
|
|
|
|
2010-10-24 02:19:20 +07:00
|
|
|
static inline int inode_unhashed(struct inode *inode)
|
|
|
|
{
|
|
|
|
return hlist_unhashed(&inode->i_hash);
|
|
|
|
}
|
|
|
|
|
2018-06-30 06:36:57 +07:00
|
|
|
/*
|
|
|
|
* __mark_inode_dirty expects inodes to be hashed. Since we don't
|
|
|
|
* want special inodes in the fileset inode space, we make them
|
|
|
|
* appear hashed, but do not put on any lists. hlist_del()
|
|
|
|
* will work fine and require no locking.
|
|
|
|
*/
|
|
|
|
static inline void inode_fake_hash(struct inode *inode)
|
|
|
|
{
|
|
|
|
hlist_add_fake(&inode->i_hash);
|
|
|
|
}
|
|
|
|
|
2006-07-03 14:25:05 +07:00
|
|
|
/*
|
|
|
|
* inode->i_mutex nesting subclasses for the lock validator:
|
|
|
|
*
|
|
|
|
* 0: the object of the current VFS operation
|
|
|
|
* 1: parent
|
|
|
|
* 2: child/target
|
2012-04-19 02:21:34 +07:00
|
|
|
* 3: xattr
|
|
|
|
* 4: second non-directory
|
2014-10-27 21:42:01 +07:00
|
|
|
* 5: second parent (when locking independent directories in rename)
|
|
|
|
*
|
|
|
|
* I_MUTEX_NONDIR2 is for certain operations (such as rename) which lock two
|
2012-04-19 02:21:34 +07:00
|
|
|
* non-directories at once.
|
2006-07-03 14:25:05 +07:00
|
|
|
*
|
|
|
|
* The locking order between these classes is
|
2014-10-27 21:42:01 +07:00
|
|
|
* parent[2] -> child -> grandchild -> normal -> xattr -> second non-directory
|
2006-07-03 14:25:05 +07:00
|
|
|
*/
|
|
|
|
enum inode_i_mutex_lock_class
|
|
|
|
{
|
|
|
|
I_MUTEX_NORMAL,
|
|
|
|
I_MUTEX_PARENT,
|
|
|
|
I_MUTEX_CHILD,
|
2006-08-27 15:23:56 +07:00
|
|
|
I_MUTEX_XATTR,
|
2014-10-27 21:42:01 +07:00
|
|
|
I_MUTEX_NONDIR2,
|
|
|
|
I_MUTEX_PARENT2,
|
2006-07-03 14:25:05 +07:00
|
|
|
};
|
|
|
|
|
2016-01-23 03:40:57 +07:00
|
|
|
static inline void inode_lock(struct inode *inode)
|
|
|
|
{
|
2016-04-16 02:08:36 +07:00
|
|
|
down_write(&inode->i_rwsem);
|
2016-01-23 03:40:57 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void inode_unlock(struct inode *inode)
|
|
|
|
{
|
2016-04-16 02:08:36 +07:00
|
|
|
up_write(&inode->i_rwsem);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void inode_lock_shared(struct inode *inode)
|
|
|
|
{
|
|
|
|
down_read(&inode->i_rwsem);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void inode_unlock_shared(struct inode *inode)
|
|
|
|
{
|
|
|
|
up_read(&inode->i_rwsem);
|
2016-01-23 03:40:57 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline int inode_trylock(struct inode *inode)
|
|
|
|
{
|
2016-04-16 02:08:36 +07:00
|
|
|
return down_write_trylock(&inode->i_rwsem);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int inode_trylock_shared(struct inode *inode)
|
|
|
|
{
|
|
|
|
return down_read_trylock(&inode->i_rwsem);
|
2016-01-23 03:40:57 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline int inode_is_locked(struct inode *inode)
|
|
|
|
{
|
2016-04-16 02:08:36 +07:00
|
|
|
return rwsem_is_locked(&inode->i_rwsem);
|
2016-01-23 03:40:57 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void inode_lock_nested(struct inode *inode, unsigned subclass)
|
|
|
|
{
|
2016-04-16 02:08:36 +07:00
|
|
|
down_write_nested(&inode->i_rwsem, subclass);
|
2016-01-23 03:40:57 +07:00
|
|
|
}
|
|
|
|
|
2018-01-19 05:07:53 +07:00
|
|
|
static inline void inode_lock_shared_nested(struct inode *inode, unsigned subclass)
|
|
|
|
{
|
|
|
|
down_read_nested(&inode->i_rwsem, subclass);
|
|
|
|
}
|
|
|
|
|
2012-04-19 02:16:33 +07:00
|
|
|
void lock_two_nondirectories(struct inode *, struct inode*);
|
|
|
|
void unlock_two_nondirectories(struct inode *, struct inode*);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* NOTE: in a 32bit arch with a preemptable kernel and
|
|
|
|
* an UP compile the i_size_read/write must be atomic
|
|
|
|
* with respect to the local cpu (unlike with preempt disabled),
|
|
|
|
* but they don't need to be atomic with respect to other cpus like in
|
|
|
|
* true SMP (so they need either to either locally disable irq around
|
|
|
|
* the read or for example on x86 they can be still implemented as a
|
|
|
|
* cmpxchg8b without the need of the lock prefix). For SMP compiles
|
|
|
|
* and 64bit archs it makes no difference if preempt is enabled or not.
|
|
|
|
*/
|
2006-12-07 11:35:37 +07:00
|
|
|
static inline loff_t i_size_read(const struct inode *inode)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
|
|
|
|
loff_t i_size;
|
|
|
|
unsigned int seq;
|
|
|
|
|
|
|
|
do {
|
|
|
|
seq = read_seqcount_begin(&inode->i_size_seqcount);
|
|
|
|
i_size = inode->i_size;
|
|
|
|
} while (read_seqcount_retry(&inode->i_size_seqcount, seq));
|
|
|
|
return i_size;
|
2019-10-16 02:18:10 +07:00
|
|
|
#elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPTION)
|
2005-04-17 05:20:36 +07:00
|
|
|
loff_t i_size;
|
|
|
|
|
|
|
|
preempt_disable();
|
|
|
|
i_size = inode->i_size;
|
|
|
|
preempt_enable();
|
|
|
|
return i_size;
|
|
|
|
#else
|
|
|
|
return inode->i_size;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2006-10-17 14:10:07 +07:00
|
|
|
/*
|
|
|
|
* NOTE: unlike i_size_read(), i_size_write() does need locking around it
|
|
|
|
* (normally i_mutex), otherwise on 32bit/SMP an update of i_size_seqcount
|
|
|
|
* can be lost, resulting in subsequent i_size_read() calls spinning forever.
|
|
|
|
*/
|
2005-04-17 05:20:36 +07:00
|
|
|
static inline void i_size_write(struct inode *inode, loff_t i_size)
|
|
|
|
{
|
|
|
|
#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
|
2013-05-01 05:27:27 +07:00
|
|
|
preempt_disable();
|
2005-04-17 05:20:36 +07:00
|
|
|
write_seqcount_begin(&inode->i_size_seqcount);
|
|
|
|
inode->i_size = i_size;
|
|
|
|
write_seqcount_end(&inode->i_size_seqcount);
|
2013-05-01 05:27:27 +07:00
|
|
|
preempt_enable();
|
2019-10-16 02:18:10 +07:00
|
|
|
#elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPTION)
|
2005-04-17 05:20:36 +07:00
|
|
|
preempt_disable();
|
|
|
|
inode->i_size = i_size;
|
|
|
|
preempt_enable();
|
|
|
|
#else
|
|
|
|
inode->i_size = i_size;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2006-12-07 11:35:37 +07:00
|
|
|
static inline unsigned iminor(const struct inode *inode)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
return MINOR(inode->i_rdev);
|
|
|
|
}
|
|
|
|
|
2006-12-07 11:35:37 +07:00
|
|
|
static inline unsigned imajor(const struct inode *inode)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
return MAJOR(inode->i_rdev);
|
|
|
|
}
|
|
|
|
|
|
|
|
struct fown_struct {
|
|
|
|
rwlock_t lock; /* protects pid, uid, euid fields */
|
2006-10-02 16:17:15 +07:00
|
|
|
struct pid *pid; /* pid or -pgrp where SIGIO should be sent */
|
|
|
|
enum pid_type pid_type; /* Kind of process group SIGIO should be sent to */
|
2012-02-08 22:07:50 +07:00
|
|
|
kuid_t uid, euid; /* uid/euid of process setting the owner */
|
2005-04-17 05:20:36 +07:00
|
|
|
int signum; /* posix.1b rt signal to be delivered on IO */
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Track a single file's readahead state
|
|
|
|
*/
|
|
|
|
struct file_ra_state {
|
2007-10-16 15:24:31 +07:00
|
|
|
pgoff_t start; /* where readahead started */
|
|
|
|
unsigned int size; /* # of readahead pages */
|
|
|
|
unsigned int async_size; /* do asynchronous readahead when
|
2007-07-19 15:48:08 +07:00
|
|
|
there are only # of pages ahead */
|
2007-07-19 15:47:59 +07:00
|
|
|
|
2007-10-16 15:24:31 +07:00
|
|
|
unsigned int ra_pages; /* Maximum readahead window */
|
2009-06-17 05:31:19 +07:00
|
|
|
unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
|
2007-10-16 15:24:33 +07:00
|
|
|
loff_t prev_pos; /* Cache last read() position */
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
2007-07-19 15:47:59 +07:00
|
|
|
/*
|
|
|
|
* Check if @index falls in the readahead windows.
|
|
|
|
*/
|
|
|
|
static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
|
|
|
|
{
|
2007-07-19 15:48:08 +07:00
|
|
|
return (index >= ra->start &&
|
|
|
|
index < ra->start + ra->size);
|
2007-07-19 15:47:59 +07:00
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
struct file {
|
2005-10-31 06:02:16 +07:00
|
|
|
union {
|
2013-07-09 04:24:16 +07:00
|
|
|
struct llist_node fu_llist;
|
2005-10-31 06:02:16 +07:00
|
|
|
struct rcu_head fu_rcuhead;
|
|
|
|
} f_u;
|
2006-12-08 17:36:35 +07:00
|
|
|
struct path f_path;
|
2013-03-02 07:48:30 +07:00
|
|
|
struct inode *f_inode; /* cached value */
|
2006-03-28 16:56:41 +07:00
|
|
|
const struct file_operations *f_op;
|
2011-09-16 06:06:48 +07:00
|
|
|
|
|
|
|
/*
|
2014-03-04 00:36:58 +07:00
|
|
|
* Protects f_ep_links, f_flags.
|
2011-09-16 06:06:48 +07:00
|
|
|
* Must not be taken from IRQ context.
|
|
|
|
*/
|
|
|
|
spinlock_t f_lock;
|
fs: add fcntl() interface for setting/getting write life time hints
Define a set of write life time hints:
RWH_WRITE_LIFE_NOT_SET No hint information set
RWH_WRITE_LIFE_NONE No hints about write life time
RWH_WRITE_LIFE_SHORT Data written has a short life time
RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
RWH_WRITE_LIFE_LONG Data written has a long life time
RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time
The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.
Add an fcntl interface for querying these flags, and also for
setting them as well:
F_GET_RW_HINT Returns the read/write hint set on the
underlying inode.
F_SET_RW_HINT Set one of the above write hints on the
underlying inode.
F_GET_FILE_RW_HINT Returns the read/write hint set on the
file descriptor.
F_SET_FILE_RW_HINT Set one of the above write hints on the
file descriptor.
The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.
Sample program testing/implementing basic setting/getting of write
hints is below.
Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.
This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.
/*
* writehint.c: get or set an inode write hint
*/
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdbool.h>
#include <inttypes.h>
#ifndef F_GET_RW_HINT
#define F_LINUX_SPECIFIC_BASE 1024
#define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
#define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
#endif
static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };
int main(int argc, char *argv[])
{
uint64_t hint;
int fd, ret;
if (argc < 2) {
fprintf(stderr, "%s: file <hint>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 2;
}
if (argc > 2) {
hint = atoi(argv[2]);
ret = fcntl(fd, F_SET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_SET_RW_HINT");
return 4;
}
}
ret = fcntl(fd, F_GET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_GET_RW_HINT");
return 3;
}
printf("%s: hint %s\n", argv[1], str[hint]);
close(fd);
return 0;
}
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28 00:47:04 +07:00
|
|
|
enum rw_hint f_write_hint;
|
2008-07-26 11:39:17 +07:00
|
|
|
atomic_long_t f_count;
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned int f_flags;
|
2008-09-03 02:28:45 +07:00
|
|
|
fmode_t f_mode;
|
2014-03-04 00:36:58 +07:00
|
|
|
struct mutex f_pos_lock;
|
2005-04-17 05:20:36 +07:00
|
|
|
loff_t f_pos;
|
|
|
|
struct fown_struct f_owner;
|
2008-11-14 06:39:25 +07:00
|
|
|
const struct cred *f_cred;
|
2005-04-17 05:20:36 +07:00
|
|
|
struct file_ra_state f_ra;
|
|
|
|
|
2007-10-17 13:27:21 +07:00
|
|
|
u64 f_version;
|
[PATCH] fs.h: ifdef security fields
[assuming BSD security levels are deleted]
The only user of i_security, f_security, s_security fields is SELinux,
however, quite a few security modules are trying to get into kernel.
So, wrap them under CONFIG_SECURITY. Adding config option for each
security field is likely an overkill.
Following Stephen Smalley's suggestion, i_security initialization is
moved to security_inode_alloc() to not clutter core code with ifdefs
and make alloc_inode() codepath tiny little bit smaller and faster.
The user of (highly greppable) struct fown_struct::security field is
still to be found. I've checked every "fown_struct" and every "f_owner"
occurence. Additionally it's removal doesn't break i386 allmodconfig
build.
struct inode, struct file, struct super_block, struct fown_struct
become smaller.
P.S. Combined with two reiserfs inode shrinking patches sent to
linux-fsdevel, I can finally suck 12 reiserfs inodes into one page.
/proc/slabinfo
-ext2_inode_cache 388 10
+ext2_inode_cache 384 10
-inode_cache 280 14
+inode_cache 276 14
-proc_inode_cache 296 13
+proc_inode_cache 292 13
-reiser_inode_cache 336 11
+reiser_inode_cache 332 12 <=
-shmem_inode_cache 372 10
+shmem_inode_cache 368 10
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 16:00:01 +07:00
|
|
|
#ifdef CONFIG_SECURITY
|
2005-04-17 05:20:36 +07:00
|
|
|
void *f_security;
|
[PATCH] fs.h: ifdef security fields
[assuming BSD security levels are deleted]
The only user of i_security, f_security, s_security fields is SELinux,
however, quite a few security modules are trying to get into kernel.
So, wrap them under CONFIG_SECURITY. Adding config option for each
security field is likely an overkill.
Following Stephen Smalley's suggestion, i_security initialization is
moved to security_inode_alloc() to not clutter core code with ifdefs
and make alloc_inode() codepath tiny little bit smaller and faster.
The user of (highly greppable) struct fown_struct::security field is
still to be found. I've checked every "fown_struct" and every "f_owner"
occurence. Additionally it's removal doesn't break i386 allmodconfig
build.
struct inode, struct file, struct super_block, struct fown_struct
become smaller.
P.S. Combined with two reiserfs inode shrinking patches sent to
linux-fsdevel, I can finally suck 12 reiserfs inodes into one page.
/proc/slabinfo
-ext2_inode_cache 388 10
+ext2_inode_cache 384 10
-inode_cache 280 14
+inode_cache 276 14
-proc_inode_cache 296 13
+proc_inode_cache 292 13
-reiser_inode_cache 336 11
+reiser_inode_cache 332 12 <=
-shmem_inode_cache 372 10
+shmem_inode_cache 368 10
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 16:00:01 +07:00
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
/* needed for tty driver, and maybe others */
|
|
|
|
void *private_data;
|
|
|
|
|
|
|
|
#ifdef CONFIG_EPOLL
|
|
|
|
/* Used by fs/eventpoll.c to link all the hooks to this file */
|
|
|
|
struct list_head f_ep_links;
|
epoll: limit paths
The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely. A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.
To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited. Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm. In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.
Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.
This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events. I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'. In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5. Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links. This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.
In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'. In this case, each 'source file descriptor' has a 1 path of
length 1. Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous. Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.
In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations. Currently its only used in a subset
of the add paths. I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths. I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead. Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.
Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:
/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4
This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1). However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order. Thus, this limit is currently easily bypassed. The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.
Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL. I've also
testing using the piptest.c epoll tester, which showed no difference in
performance. I've also created a number of different epoll networks and
tested that they behave as expectded.
I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Nelson Elhage <nelhage@ksplice.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13 08:17:43 +07:00
|
|
|
struct list_head f_tfile_llink;
|
2005-04-17 05:20:36 +07:00
|
|
|
#endif /* #ifdef CONFIG_EPOLL */
|
|
|
|
struct address_space *f_mapping;
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 18:02:25 +07:00
|
|
|
errseq_t f_wb_err;
|
vfs: track per-sb writeback errors and report them to syncfs
Patch series "vfs: have syncfs() return error when there are writeback
errors", v6.
Currently, syncfs does not return errors when one of the inodes fails to
be written back. It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.
The basic idea in this set is to track writeback errors at the
superblock level, so that we can quickly and easily check whether
something bad happened without having to fsync each file individually.
syncfs is then changed to reliably report writeback errors after they
occur, much in the same fashion as fsync does now.
This patch (of 2):
Usually we suggest that applications call fsync when they want to ensure
that all data written to the file has made it to the backing store, but
that can be inefficient when there are a lot of open files.
Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect. If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error. syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.
It would be better if syncfs reported an error if there were any
writeback failures. Then applications could call syncfs to see if there
are any errors on any open files, and could then call fsync on all of
the other descriptors to figure out which one failed.
This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.
To report those errors, we also need to keep an errseq_t in struct file
to act as a cursor. This patch adds a dedicated field for that purpose,
which slots nicely into 4 bytes of padding at the end of struct file on
x86_64.
An earlier version of this patch used an O_PATH file descriptor to cue
the kernel that the open file should track the superblock error and not
the inode's writeback error.
I think that API is just too weird though. This is simpler and should
make syncfs error reporting "just work" even if someone is multiplexing
fsync and syncfs on the same fds.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 11:45:36 +07:00
|
|
|
errseq_t f_sb_err; /* for syncfs */
|
2016-10-28 15:22:25 +07:00
|
|
|
} __randomize_layout
|
|
|
|
__attribute__((aligned(4))); /* lest something weird decides that 2 is OK */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-01-29 20:13:26 +07:00
|
|
|
struct file_handle {
|
|
|
|
__u32 handle_bytes;
|
|
|
|
int handle_type;
|
|
|
|
/* file identifier */
|
2020-05-04 23:16:37 +07:00
|
|
|
unsigned char f_handle[];
|
2011-01-29 20:13:26 +07:00
|
|
|
};
|
|
|
|
|
2012-08-28 01:48:26 +07:00
|
|
|
static inline struct file *get_file(struct file *f)
|
|
|
|
{
|
|
|
|
atomic_long_inc(&f->f_count);
|
|
|
|
return f;
|
|
|
|
}
|
2018-11-22 00:32:39 +07:00
|
|
|
#define get_file_rcu_many(x, cnt) \
|
|
|
|
atomic_long_add_unless(&(x)->f_count, (cnt), 0)
|
|
|
|
#define get_file_rcu(x) get_file_rcu_many((x), 1)
|
2008-07-26 11:39:17 +07:00
|
|
|
#define file_count(x) atomic_long_read(&(x)->f_count)
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
#define MAX_NON_LFS ((1UL<<31) - 1)
|
|
|
|
|
|
|
|
/* Page cache limit. The filesystems should put that into their s_maxbytes
|
|
|
|
limits, otherwise bad things can happen in VM. */
|
|
|
|
#if BITS_PER_LONG==32
|
Clarify (and fix) MAX_LFS_FILESIZE macros
We have a MAX_LFS_FILESIZE macro that is meant to be filled in by
filesystems (and other IO targets) that know they are 64-bit clean and
don't have any 32-bit limits in their IO path.
It turns out that our 32-bit value for that limit was bogus. On 32-bit,
the VM layer is limited by the page cache to only 32-bit index values,
but our logic for that was confusing and actually wrong. We used to
define that value to
(((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
which is actually odd in several ways: it limits the index to 31 bits,
and then it limits files so that they can't have data in that last byte
of a page that has the highest 31-bit index (ie page index 0x7fffffff).
Neither of those limitations make sense. The index is actually the full
32 bit unsigned value, and we can use that whole full page. So the
maximum size of the file would logically be "PAGE_SIZE << BITS_PER_LONG".
However, we do wan tto avoid the maximum index, because we have code
that iterates over the page indexes, and we don't want that code to
overflow. So the maximum size of a file on a 32-bit host should
actually be one page less than the full 32-bit index.
So the actual limit is ULONG_MAX << PAGE_SHIFT. That means that we will
not actually be using the page of that last index (ULONG_MAX), but we
can grow a file up to that limit.
The wrong value of MAX_LFS_FILESIZE actually caused problems for Doug
Nazar, who was still using a 32-bit host, but with a 9.7TB 2 x RAID5
volume. It turns out that our old MAX_LFS_FILESIZE was 8TiB (well, one
byte less), but the actual true VM limit is one page less than 16TiB.
This was invisible until commit c2a9737f45e2 ("vfs,mm: fix a dead loop
in truncate_inode_pages_range()"), which started applying that
MAX_LFS_FILESIZE limit to block devices too.
NOTE! On 64-bit, the page index isn't a limiter at all, and the limit is
actually just the offset type itself (loff_t), which is signed. But for
clarity, on 64-bit, just use the maximum signed value, and don't make
people have to count the number of 'f' characters in the hex constant.
So just use LLONG_MAX for the 64-bit case. That was what the value had
been before too, just written out as a hex constant.
Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
Reported-and-tested-by: Doug Nazar <nazard@nazar.ca>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-28 02:12:25 +07:00
|
|
|
#define MAX_LFS_FILESIZE ((loff_t)ULONG_MAX << PAGE_SHIFT)
|
2005-04-17 05:20:36 +07:00
|
|
|
#elif BITS_PER_LONG==64
|
Clarify (and fix) MAX_LFS_FILESIZE macros
We have a MAX_LFS_FILESIZE macro that is meant to be filled in by
filesystems (and other IO targets) that know they are 64-bit clean and
don't have any 32-bit limits in their IO path.
It turns out that our 32-bit value for that limit was bogus. On 32-bit,
the VM layer is limited by the page cache to only 32-bit index values,
but our logic for that was confusing and actually wrong. We used to
define that value to
(((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
which is actually odd in several ways: it limits the index to 31 bits,
and then it limits files so that they can't have data in that last byte
of a page that has the highest 31-bit index (ie page index 0x7fffffff).
Neither of those limitations make sense. The index is actually the full
32 bit unsigned value, and we can use that whole full page. So the
maximum size of the file would logically be "PAGE_SIZE << BITS_PER_LONG".
However, we do wan tto avoid the maximum index, because we have code
that iterates over the page indexes, and we don't want that code to
overflow. So the maximum size of a file on a 32-bit host should
actually be one page less than the full 32-bit index.
So the actual limit is ULONG_MAX << PAGE_SHIFT. That means that we will
not actually be using the page of that last index (ULONG_MAX), but we
can grow a file up to that limit.
The wrong value of MAX_LFS_FILESIZE actually caused problems for Doug
Nazar, who was still using a 32-bit host, but with a 9.7TB 2 x RAID5
volume. It turns out that our old MAX_LFS_FILESIZE was 8TiB (well, one
byte less), but the actual true VM limit is one page less than 16TiB.
This was invisible until commit c2a9737f45e2 ("vfs,mm: fix a dead loop
in truncate_inode_pages_range()"), which started applying that
MAX_LFS_FILESIZE limit to block devices too.
NOTE! On 64-bit, the page index isn't a limiter at all, and the limit is
actually just the offset type itself (loff_t), which is signed. But for
clarity, on 64-bit, just use the maximum signed value, and don't make
people have to count the number of 'f' characters in the hex constant.
So just use LLONG_MAX for the 64-bit case. That was what the value had
been before too, just written out as a hex constant.
Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
Reported-and-tested-by: Doug Nazar <nazard@nazar.ca>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-28 02:12:25 +07:00
|
|
|
#define MAX_LFS_FILESIZE ((loff_t)LLONG_MAX)
|
2005-04-17 05:20:36 +07:00
|
|
|
#endif
|
|
|
|
|
|
|
|
#define FL_POSIX 1
|
|
|
|
#define FL_FLOCK 2
|
2011-07-02 02:18:34 +07:00
|
|
|
#define FL_DELEG 4 /* NFSv4 delegation */
|
2005-04-17 05:20:36 +07:00
|
|
|
#define FL_ACCESS 8 /* not trying to lock, just looking */
|
2006-06-30 03:38:32 +07:00
|
|
|
#define FL_EXISTS 16 /* when unlocking, test for existence */
|
2005-04-17 05:20:36 +07:00
|
|
|
#define FL_LEASE 32 /* lease held on this file */
|
2006-06-23 16:05:12 +07:00
|
|
|
#define FL_CLOSE 64 /* unlock on close */
|
2005-04-17 05:20:36 +07:00
|
|
|
#define FL_SLEEP 128 /* A blocking lock */
|
2011-07-27 05:25:49 +07:00
|
|
|
#define FL_DOWNGRADE_PENDING 256 /* Lease is being downgraded */
|
|
|
|
#define FL_UNLOCK_PENDING 512 /* Lease is being broken */
|
2014-04-22 19:24:32 +07:00
|
|
|
#define FL_OFDLCK 1024 /* lock is "owned" by struct file */
|
2015-01-22 01:17:03 +07:00
|
|
|
#define FL_LAYOUT 2048 /* outstanding pNFS layout */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2017-04-11 23:50:09 +07:00
|
|
|
#define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
|
|
|
|
|
2008-07-25 15:48:57 +07:00
|
|
|
/*
|
|
|
|
* Special return value from posix_lock_file() and vfs_lock_file() for
|
|
|
|
* asynchronous locking.
|
|
|
|
*/
|
|
|
|
#define FILE_LOCK_DEFERRED 1
|
|
|
|
|
2014-09-02 06:04:48 +07:00
|
|
|
/* legacy typedef, should eventually be removed */
|
2014-07-13 22:00:37 +07:00
|
|
|
typedef void *fl_owner_t;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-01-17 03:05:56 +07:00
|
|
|
struct file_lock;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
struct file_lock_operations {
|
|
|
|
void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
|
|
|
|
void (*fl_release_private)(struct file_lock *);
|
|
|
|
};
|
|
|
|
|
|
|
|
struct lock_manager_operations {
|
2015-04-03 20:04:04 +07:00
|
|
|
fl_owner_t (*lm_get_owner)(fl_owner_t);
|
|
|
|
void (*lm_put_owner)(fl_owner_t);
|
2011-07-21 07:21:59 +07:00
|
|
|
void (*lm_notify)(struct file_lock *); /* unblock callback */
|
2014-08-22 21:18:42 +07:00
|
|
|
int (*lm_grant)(struct file_lock *, int);
|
2014-09-02 02:06:54 +07:00
|
|
|
bool (*lm_break)(struct file_lock *);
|
2015-01-17 03:05:57 +07:00
|
|
|
int (*lm_change)(struct file_lock *, int, struct list_head *);
|
2014-08-22 21:55:47 +07:00
|
|
|
void (*lm_setup)(struct file_lock *, void **);
|
2017-07-29 03:35:15 +07:00
|
|
|
bool (*lm_breaker_owns_lease)(struct file_lock *);
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
2007-09-06 23:34:25 +07:00
|
|
|
struct lock_manager {
|
|
|
|
struct list_head list;
|
2015-08-06 23:47:02 +07:00
|
|
|
/*
|
|
|
|
* NFSv4 and up also want opens blocked during the grace period;
|
|
|
|
* NLM doesn't care:
|
|
|
|
*/
|
|
|
|
bool block_opens;
|
2007-09-06 23:34:25 +07:00
|
|
|
};
|
|
|
|
|
2012-07-25 19:57:22 +07:00
|
|
|
struct net;
|
|
|
|
void locks_start_grace(struct net *, struct lock_manager *);
|
2007-09-06 23:34:25 +07:00
|
|
|
void locks_end_grace(struct lock_manager *);
|
2017-09-26 14:14:07 +07:00
|
|
|
bool locks_in_grace(struct net *);
|
|
|
|
bool opens_in_grace(struct net *);
|
2007-09-06 23:34:25 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/* that will die - we need it for nfs_lock_info */
|
|
|
|
#include <linux/nfs_fs_i.h>
|
|
|
|
|
2013-06-21 19:58:12 +07:00
|
|
|
/*
|
|
|
|
* struct file_lock represents a generic "file lock". It's used to represent
|
|
|
|
* POSIX byte range locks, BSD (flock) locks, and leases. It's important to
|
|
|
|
* note that the same struct is used to represent both a request for a lock and
|
|
|
|
* the lock itself, but the same object is never used for both.
|
|
|
|
*
|
|
|
|
* FIXME: should we create a separate "struct lock_request" to help distinguish
|
|
|
|
* these two uses?
|
|
|
|
*
|
2015-01-22 08:44:01 +07:00
|
|
|
* The varous i_flctx lists are ordered by:
|
2013-06-21 19:58:12 +07:00
|
|
|
*
|
2015-01-22 08:44:01 +07:00
|
|
|
* 1) lock owner
|
|
|
|
* 2) lock range start
|
|
|
|
* 3) lock range end
|
2013-06-21 19:58:12 +07:00
|
|
|
*
|
|
|
|
* Obviously, the last two criteria only matter for POSIX locks.
|
|
|
|
*/
|
2005-04-17 05:20:36 +07:00
|
|
|
struct file_lock {
|
2018-11-30 06:04:08 +07:00
|
|
|
struct file_lock *fl_blocker; /* The lock, that is blocking us */
|
2015-01-17 03:05:54 +07:00
|
|
|
struct list_head fl_list; /* link into file_lock_context */
|
2013-06-21 19:58:17 +07:00
|
|
|
struct hlist_node fl_link; /* node in global lists */
|
2018-11-30 06:04:08 +07:00
|
|
|
struct list_head fl_blocked_requests; /* list of requests with
|
|
|
|
* ->fl_blocker pointing here
|
|
|
|
*/
|
|
|
|
struct list_head fl_blocked_member; /* node in
|
|
|
|
* ->fl_blocker->fl_blocked_requests
|
|
|
|
*/
|
2005-04-17 05:20:36 +07:00
|
|
|
fl_owner_t fl_owner;
|
2011-07-27 03:28:29 +07:00
|
|
|
unsigned int fl_flags;
|
2008-07-12 07:20:49 +07:00
|
|
|
unsigned char fl_type;
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned int fl_pid;
|
2013-06-21 19:58:22 +07:00
|
|
|
int fl_link_cpu; /* what cpu's list is this on? */
|
2005-04-17 05:20:36 +07:00
|
|
|
wait_queue_head_t fl_wait;
|
|
|
|
struct file *fl_file;
|
|
|
|
loff_t fl_start;
|
|
|
|
loff_t fl_end;
|
|
|
|
|
|
|
|
struct fasync_struct * fl_fasync; /* for lease break notifications */
|
2011-07-27 05:25:49 +07:00
|
|
|
/* for lease breaks: */
|
|
|
|
unsigned long fl_break_time;
|
|
|
|
unsigned long fl_downgrade_time;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-09-22 07:01:11 +07:00
|
|
|
const struct file_lock_operations *fl_ops; /* Callbacks for filesystems */
|
2009-09-22 07:01:12 +07:00
|
|
|
const struct lock_manager_operations *fl_lmops; /* Callbacks for lockmanagers */
|
2005-04-17 05:20:36 +07:00
|
|
|
union {
|
|
|
|
struct nfs_lock_info nfs_fl;
|
2005-06-23 00:16:32 +07:00
|
|
|
struct nfs4_lock_info nfs4_fl;
|
2007-07-16 13:40:12 +07:00
|
|
|
struct {
|
|
|
|
struct list_head link; /* link in AFS vnode's pending_locks list */
|
|
|
|
int state; /* state of grant or error if -ve */
|
2019-04-25 20:26:50 +07:00
|
|
|
unsigned int debug_id;
|
2007-07-16 13:40:12 +07:00
|
|
|
} afs;
|
2005-04-17 05:20:36 +07:00
|
|
|
} fl_u;
|
2016-10-28 15:22:25 +07:00
|
|
|
} __randomize_layout;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-01-17 03:05:54 +07:00
|
|
|
struct file_lock_context {
|
2015-01-17 03:05:57 +07:00
|
|
|
spinlock_t flc_lock;
|
2015-01-17 03:05:54 +07:00
|
|
|
struct list_head flc_flock;
|
2015-01-17 03:05:55 +07:00
|
|
|
struct list_head flc_posix;
|
2015-01-17 03:05:55 +07:00
|
|
|
struct list_head flc_lease;
|
2015-01-17 03:05:54 +07:00
|
|
|
};
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/* The following constant reflects the upper bound of the file/locking space */
|
|
|
|
#ifndef OFFSET_MAX
|
|
|
|
#define INT_LIMIT(x) (~((x)1 << (sizeof(x)*8 - 1)))
|
|
|
|
#define OFFSET_MAX INT_LIMIT(loff_t)
|
|
|
|
#define OFFT_OFFSET_MAX INT_LIMIT(off_t)
|
|
|
|
#endif
|
|
|
|
|
2008-08-06 20:12:22 +07:00
|
|
|
extern void send_sigio(struct fown_struct *fown, int fd, int band);
|
|
|
|
|
2018-07-18 20:44:43 +07:00
|
|
|
#define locks_inode(f) file_inode(f)
|
2016-09-16 17:44:20 +07:00
|
|
|
|
2008-08-06 20:12:22 +07:00
|
|
|
#ifdef CONFIG_FILE_LOCKING
|
2017-05-27 17:07:19 +07:00
|
|
|
extern int fcntl_getlk(struct file *, unsigned int, struct flock *);
|
[PATCH] stale POSIX lock handling
I believe that there is a problem with the handling of POSIX locks, which
the attached patch should address.
The problem appears to be a race between fcntl(2) and close(2). A
multithreaded application could close a file descriptor at the same time as
it is trying to acquire a lock using the same file descriptor. I would
suggest that that multithreaded application is not providing the proper
synchronization for itself, but the OS should still behave correctly.
SUS3 (Single UNIX Specification Version 3, read: POSIX) indicates that when
a file descriptor is closed, that all POSIX locks on the file, owned by the
process which closed the file descriptor, should be released.
The trick here is when those locks are released. The current code releases
all locks which exist when close is processing, but any locks in progress
are handled when the last reference to the open file is released.
There are three cases to consider.
One is the simple case, a multithreaded (mt) process has a file open and
races to close it and acquire a lock on it. In this case, the close will
release one reference to the open file and when the fcntl is done, it will
release the other reference. For this situation, no locks should exist on
the file when both the close and fcntl operations are done. The current
system will handle this case because the last reference to the open file is
being released.
The second case is when the mt process has dup(2)'d the file descriptor.
The close will release one reference to the file and the fcntl, when done,
will release another, but there will still be at least one more reference
to the open file. One could argue that the existence of a lock on the file
after the close has completed is okay, because it was acquired after the
close operation and there is still a way for the application to release the
lock on the file, using an existing file descriptor.
The third case is when the mt process has forked, after opening the file
and either before or after becoming an mt process. In this case, each
process would hold a reference to the open file. For each process, this
degenerates to first case above. However, the lock continues to exist
until both processes have released their references to the open file. This
lock could block other lock requests.
The changes to release the lock when the last reference to the open file
aren't quite right because they would allow the lock to exist as long as
there was a reference to the open file. This is too long.
The new proposed solution is to add support in the fcntl code path to
detect a race with close and then to release the lock which was just
acquired when such as race is detected. This causes locks to be released
in a timely fashion and for the system to conform to the POSIX semantic
specification.
This was tested by instrumenting a kernel to detect the handling locks and
then running a program which generates case #3 above. A dangling lock
could be reliably generated. When the changes to detect the close/fcntl
race were added, a dangling lock could no longer be generated.
Cc: Matthew Wilcox <willy@debian.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-28 01:45:09 +07:00
|
|
|
extern int fcntl_setlk(unsigned int, struct file *, unsigned int,
|
2017-05-27 17:07:19 +07:00
|
|
|
struct flock *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
#if BITS_PER_LONG == 32
|
2017-05-27 17:07:19 +07:00
|
|
|
extern int fcntl_getlk64(struct file *, unsigned int, struct flock64 *);
|
[PATCH] stale POSIX lock handling
I believe that there is a problem with the handling of POSIX locks, which
the attached patch should address.
The problem appears to be a race between fcntl(2) and close(2). A
multithreaded application could close a file descriptor at the same time as
it is trying to acquire a lock using the same file descriptor. I would
suggest that that multithreaded application is not providing the proper
synchronization for itself, but the OS should still behave correctly.
SUS3 (Single UNIX Specification Version 3, read: POSIX) indicates that when
a file descriptor is closed, that all POSIX locks on the file, owned by the
process which closed the file descriptor, should be released.
The trick here is when those locks are released. The current code releases
all locks which exist when close is processing, but any locks in progress
are handled when the last reference to the open file is released.
There are three cases to consider.
One is the simple case, a multithreaded (mt) process has a file open and
races to close it and acquire a lock on it. In this case, the close will
release one reference to the open file and when the fcntl is done, it will
release the other reference. For this situation, no locks should exist on
the file when both the close and fcntl operations are done. The current
system will handle this case because the last reference to the open file is
being released.
The second case is when the mt process has dup(2)'d the file descriptor.
The close will release one reference to the file and the fcntl, when done,
will release another, but there will still be at least one more reference
to the open file. One could argue that the existence of a lock on the file
after the close has completed is okay, because it was acquired after the
close operation and there is still a way for the application to release the
lock on the file, using an existing file descriptor.
The third case is when the mt process has forked, after opening the file
and either before or after becoming an mt process. In this case, each
process would hold a reference to the open file. For each process, this
degenerates to first case above. However, the lock continues to exist
until both processes have released their references to the open file. This
lock could block other lock requests.
The changes to release the lock when the last reference to the open file
aren't quite right because they would allow the lock to exist as long as
there was a reference to the open file. This is too long.
The new proposed solution is to add support in the fcntl code path to
detect a race with close and then to release the lock which was just
acquired when such as race is detected. This causes locks to be released
in a timely fashion and for the system to conform to the POSIX semantic
specification.
This was tested by instrumenting a kernel to detect the handling locks and
then running a program which generates case #3 above. A dangling lock
could be reliably generated. When the changes to detect the close/fcntl
race were added, a dangling lock could no longer be generated.
Cc: Matthew Wilcox <willy@debian.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-28 01:45:09 +07:00
|
|
|
extern int fcntl_setlk64(unsigned int, struct file *, unsigned int,
|
2017-05-27 17:07:19 +07:00
|
|
|
struct flock64 *);
|
2005-04-17 05:20:36 +07:00
|
|
|
#endif
|
|
|
|
|
|
|
|
extern int fcntl_setlease(unsigned int fd, struct file *filp, long arg);
|
|
|
|
extern int fcntl_getlease(struct file *filp);
|
|
|
|
|
|
|
|
/* fs/locks.c */
|
2016-01-08 03:08:51 +07:00
|
|
|
void locks_free_lock_context(struct inode *inode);
|
2010-10-31 04:31:15 +07:00
|
|
|
void locks_free_lock(struct file_lock *fl);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void locks_init_lock(struct file_lock *);
|
2010-10-27 20:46:08 +07:00
|
|
|
extern struct file_lock * locks_alloc_lock(void);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void locks_copy_lock(struct file_lock *, struct file_lock *);
|
2014-08-22 21:18:42 +07:00
|
|
|
extern void locks_copy_conflock(struct file_lock *, struct file_lock *);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void locks_remove_posix(struct file *, fl_owner_t);
|
2014-02-04 00:13:08 +07:00
|
|
|
extern void locks_remove_file(struct file *);
|
2009-04-01 03:12:56 +07:00
|
|
|
extern void locks_release_private(struct file_lock *);
|
2007-05-12 03:09:32 +07:00
|
|
|
extern void posix_test_lock(struct file *, struct file_lock *);
|
2007-01-19 04:15:35 +07:00
|
|
|
extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *);
|
2018-11-30 06:04:08 +07:00
|
|
|
extern int locks_delete_block(struct file_lock *);
|
2007-02-21 12:58:50 +07:00
|
|
|
extern int vfs_test_lock(struct file *, struct file_lock *);
|
2007-01-19 04:15:35 +07:00
|
|
|
extern int vfs_lock_file(struct file *, unsigned int, struct file_lock *, struct file_lock *);
|
2007-01-19 05:52:58 +07:00
|
|
|
extern int vfs_cancel_lock(struct file *filp, struct file_lock *fl);
|
2015-10-23 00:38:13 +07:00
|
|
|
extern int locks_lock_inode_wait(struct inode *inode, struct file_lock *fl);
|
2012-03-06 01:18:59 +07:00
|
|
|
extern int __break_lease(struct inode *inode, unsigned int flags, unsigned int type);
|
vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.
The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.
The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.
virtual patch
@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}
@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}
@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}
@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }
@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }
@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)
<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)
@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}
@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-05-09 09:36:02 +07:00
|
|
|
extern void lease_get_mtime(struct inode *, struct timespec64 *time);
|
2014-08-22 21:40:25 +07:00
|
|
|
extern int generic_setlease(struct file *, long, struct file_lock **, void **priv);
|
|
|
|
extern int vfs_setlease(struct file *, long, struct file_lock **, void **);
|
2015-01-17 03:05:57 +07:00
|
|
|
extern int lease_modify(struct file_lock *, int, struct list_head *);
|
2019-08-19 01:18:45 +07:00
|
|
|
|
|
|
|
struct notifier_block;
|
|
|
|
extern int lease_register_notifier(struct notifier_block *);
|
|
|
|
extern void lease_unregister_notifier(struct notifier_block *);
|
|
|
|
|
2015-04-17 02:49:38 +07:00
|
|
|
struct files_struct;
|
|
|
|
extern void show_fd_locks(struct seq_file *f,
|
|
|
|
struct file *filp, struct files_struct *files);
|
2008-08-06 20:12:22 +07:00
|
|
|
#else /* !CONFIG_FILE_LOCKING */
|
2014-02-04 00:13:09 +07:00
|
|
|
static inline int fcntl_getlk(struct file *file, unsigned int cmd,
|
|
|
|
struct flock __user *user)
|
2009-01-20 17:29:45 +07:00
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int fcntl_setlk(unsigned int fd, struct file *file,
|
|
|
|
unsigned int cmd, struct flock __user *user)
|
|
|
|
{
|
|
|
|
return -EACCES;
|
|
|
|
}
|
|
|
|
|
2008-08-06 20:12:22 +07:00
|
|
|
#if BITS_PER_LONG == 32
|
2014-02-04 00:13:09 +07:00
|
|
|
static inline int fcntl_getlk64(struct file *file, unsigned int cmd,
|
|
|
|
struct flock64 __user *user)
|
2009-01-20 17:29:45 +07:00
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int fcntl_setlk64(unsigned int fd, struct file *file,
|
|
|
|
unsigned int cmd, struct flock64 __user *user)
|
|
|
|
{
|
|
|
|
return -EACCES;
|
|
|
|
}
|
2008-08-06 20:12:22 +07:00
|
|
|
#endif
|
2009-01-20 17:29:45 +07:00
|
|
|
static inline int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
|
|
|
|
{
|
2014-09-24 19:38:44 +07:00
|
|
|
return -EINVAL;
|
2009-01-20 17:29:45 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline int fcntl_getlease(struct file *filp)
|
|
|
|
{
|
2014-09-24 19:38:44 +07:00
|
|
|
return F_UNLCK;
|
2009-01-20 17:29:45 +07:00
|
|
|
}
|
|
|
|
|
2015-01-17 03:05:54 +07:00
|
|
|
static inline void
|
2016-01-08 03:08:51 +07:00
|
|
|
locks_free_lock_context(struct inode *inode)
|
2015-01-17 03:05:54 +07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2009-01-20 17:29:45 +07:00
|
|
|
static inline void locks_init_lock(struct file_lock *fl)
|
|
|
|
{
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2014-08-22 21:18:42 +07:00
|
|
|
static inline void locks_copy_conflock(struct file_lock *new, struct file_lock *fl)
|
2009-01-20 17:29:45 +07:00
|
|
|
{
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void locks_copy_lock(struct file_lock *new, struct file_lock *fl)
|
|
|
|
{
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void locks_remove_posix(struct file *filp, fl_owner_t owner)
|
|
|
|
{
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2014-02-04 00:13:08 +07:00
|
|
|
static inline void locks_remove_file(struct file *filp)
|
2009-01-20 17:29:45 +07:00
|
|
|
{
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void posix_test_lock(struct file *filp, struct file_lock *fl)
|
|
|
|
{
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int posix_lock_file(struct file *filp, struct file_lock *fl,
|
|
|
|
struct file_lock *conflock)
|
|
|
|
{
|
|
|
|
return -ENOLCK;
|
|
|
|
}
|
|
|
|
|
2018-11-30 06:04:08 +07:00
|
|
|
static inline int locks_delete_block(struct file_lock *waiter)
|
2009-01-20 17:29:45 +07:00
|
|
|
{
|
|
|
|
return -ENOENT;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int vfs_test_lock(struct file *filp, struct file_lock *fl)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int vfs_lock_file(struct file *filp, unsigned int cmd,
|
|
|
|
struct file_lock *fl, struct file_lock *conf)
|
|
|
|
{
|
|
|
|
return -ENOLCK;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int vfs_cancel_lock(struct file *filp, struct file_lock *fl)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-10-23 00:38:13 +07:00
|
|
|
static inline int locks_lock_inode_wait(struct inode *inode, struct file_lock *fl)
|
|
|
|
{
|
|
|
|
return -ENOLCK;
|
|
|
|
}
|
|
|
|
|
2012-03-06 01:18:59 +07:00
|
|
|
static inline int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
|
2009-01-20 17:29:45 +07:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.
The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.
The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.
virtual patch
@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}
@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}
@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}
@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }
@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }
@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)
<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)
@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}
@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-05-09 09:36:02 +07:00
|
|
|
static inline void lease_get_mtime(struct inode *inode,
|
|
|
|
struct timespec64 *time)
|
2009-01-20 17:29:45 +07:00
|
|
|
{
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int generic_setlease(struct file *filp, long arg,
|
2014-08-22 21:40:25 +07:00
|
|
|
struct file_lock **flp, void **priv)
|
2009-01-20 17:29:45 +07:00
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int vfs_setlease(struct file *filp, long arg,
|
2014-08-22 21:40:25 +07:00
|
|
|
struct file_lock **lease, void **priv)
|
2009-01-20 17:29:45 +07:00
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2015-01-17 03:05:57 +07:00
|
|
|
static inline int lease_modify(struct file_lock *fl, int arg,
|
2014-09-01 18:12:07 +07:00
|
|
|
struct list_head *dispose)
|
2009-01-20 17:29:45 +07:00
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2015-04-17 02:49:38 +07:00
|
|
|
|
|
|
|
struct files_struct;
|
|
|
|
static inline void show_fd_locks(struct seq_file *f,
|
|
|
|
struct file *filp, struct files_struct *files) {}
|
2008-08-06 20:12:22 +07:00
|
|
|
#endif /* !CONFIG_FILE_LOCKING */
|
|
|
|
|
2015-07-11 17:43:03 +07:00
|
|
|
static inline struct inode *file_inode(const struct file *f)
|
|
|
|
{
|
|
|
|
return f->f_inode;
|
|
|
|
}
|
|
|
|
|
2016-03-27 03:14:37 +07:00
|
|
|
static inline struct dentry *file_dentry(const struct file *file)
|
|
|
|
{
|
2018-07-18 20:44:44 +07:00
|
|
|
return d_real(file->f_path.dentry, file_inode(file));
|
2016-03-27 03:14:37 +07:00
|
|
|
}
|
|
|
|
|
2015-10-23 00:38:13 +07:00
|
|
|
static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl)
|
|
|
|
{
|
2016-09-16 17:44:20 +07:00
|
|
|
return locks_lock_inode_wait(locks_inode(filp), fl);
|
2015-10-23 00:38:13 +07:00
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
struct fasync_struct {
|
fasync: Fix deadlock between task-context and interrupt-context kill_fasync()
I observed the following deadlock between them:
[task 1] [task 2] [task 3]
kill_fasync() mm_update_next_owner() copy_process()
spin_lock_irqsave(&fa->fa_lock) read_lock(&tasklist_lock) write_lock_irq(&tasklist_lock)
send_sigio() <IRQ> ...
read_lock(&fown->lock) kill_fasync() ...
read_lock(&tasklist_lock) spin_lock_irqsave(&fa->fa_lock) ...
Task 1 can't acquire read locked tasklist_lock, since there is
already task 3 expressed its wish to take the lock exclusive.
Task 2 holds the read locked lock, but it can't take the spin lock.
Also, there is possible another deadlock (which I haven't observed):
[task 1] [task 2]
f_getown() kill_fasync()
read_lock(&f_own->lock) spin_lock_irqsave(&fa->fa_lock,)
<IRQ> send_sigio() write_lock_irq(&f_own->lock)
kill_fasync() read_lock(&fown->lock)
spin_lock_irqsave(&fa->fa_lock,)
Actually, we do not need exclusive fa->fa_lock in kill_fasync_rcu(),
as it guarantees fa->fa_file->f_owner integrity only. It may seem,
that it used to give a task a small possibility to receive two sequential
signals, if there are two parallel kill_fasync() callers, and task
handles the first signal fastly, but the behaviour won't become
different, since there is exclusive sighand lock in do_send_sig_info().
The patch converts fa_lock into rwlock_t, and this fixes two above
deadlocks, as rwlock is allowed to be taken from interrupt handler
by qrwlock design.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-04-05 18:58:06 +07:00
|
|
|
rwlock_t fa_lock;
|
2010-04-14 16:55:35 +07:00
|
|
|
int magic;
|
|
|
|
int fa_fd;
|
|
|
|
struct fasync_struct *fa_next; /* singly linked list */
|
|
|
|
struct file *fa_file;
|
|
|
|
struct rcu_head fa_rcu;
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
#define FASYNC_MAGIC 0x4601
|
|
|
|
|
|
|
|
/* SMP safe fasync helpers: */
|
|
|
|
extern int fasync_helper(int, struct file *, int, struct fasync_struct **);
|
2010-10-27 23:38:12 +07:00
|
|
|
extern struct fasync_struct *fasync_insert_entry(int, struct file *, struct fasync_struct **, struct fasync_struct *);
|
|
|
|
extern int fasync_remove_entry(struct file *, struct fasync_struct **);
|
|
|
|
extern struct fasync_struct *fasync_alloc(void);
|
|
|
|
extern void fasync_free(struct fasync_struct *);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/* can be called from interrupts */
|
|
|
|
extern void kill_fasync(struct fasync_struct **, int, int);
|
|
|
|
|
2014-08-22 22:27:32 +07:00
|
|
|
extern void __f_setown(struct file *filp, struct pid *, enum pid_type, int force);
|
2017-06-13 18:35:50 +07:00
|
|
|
extern int f_setown(struct file *filp, unsigned long arg, int force);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void f_delown(struct file *filp);
|
2006-10-02 16:17:15 +07:00
|
|
|
extern pid_t f_getown(struct file *filp);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern int send_sigurg(struct fown_struct *fown);
|
|
|
|
|
2017-07-17 14:45:35 +07:00
|
|
|
/*
|
|
|
|
* sb->s_flags. Note that these mirror the equivalent MS_* flags where
|
|
|
|
* represented in both.
|
|
|
|
*/
|
|
|
|
#define SB_RDONLY 1 /* Mount read-only */
|
|
|
|
#define SB_NOSUID 2 /* Ignore suid and sgid bits */
|
|
|
|
#define SB_NODEV 4 /* Disallow access to device special files */
|
|
|
|
#define SB_NOEXEC 8 /* Disallow program execution */
|
|
|
|
#define SB_SYNCHRONOUS 16 /* Writes are synced at once */
|
|
|
|
#define SB_MANDLOCK 64 /* Allow mandatory locks on an FS */
|
|
|
|
#define SB_DIRSYNC 128 /* Directory modifications are synchronous */
|
|
|
|
#define SB_NOATIME 1024 /* Do not update access times. */
|
|
|
|
#define SB_NODIRATIME 2048 /* Do not update directory access times */
|
|
|
|
#define SB_SILENT 32768
|
|
|
|
#define SB_POSIXACL (1<<16) /* VFS does not apply the umask */
|
2020-07-02 08:56:04 +07:00
|
|
|
#define SB_INLINECRYPT (1<<17) /* Use blk-crypto for encrypted files */
|
2017-07-17 14:45:35 +07:00
|
|
|
#define SB_KERNMOUNT (1<<22) /* this is a kern_mount call */
|
|
|
|
#define SB_I_VERSION (1<<23) /* Update inode I_version field */
|
|
|
|
#define SB_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */
|
|
|
|
|
|
|
|
/* These sb flags are internal to the kernel */
|
|
|
|
#define SB_SUBMOUNT (1<<26)
|
2018-11-04 21:28:36 +07:00
|
|
|
#define SB_FORCE (1<<27)
|
2017-07-17 14:45:35 +07:00
|
|
|
#define SB_NOSEC (1<<28)
|
|
|
|
#define SB_BORN (1<<29)
|
|
|
|
#define SB_ACTIVE (1<<30)
|
|
|
|
#define SB_NOUSER (1<<31)
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Umount options
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define MNT_FORCE 0x00000001 /* Attempt to forcibily umount */
|
|
|
|
#define MNT_DETACH 0x00000002 /* Just detach from the tree */
|
|
|
|
#define MNT_EXPIRE 0x00000004 /* Mark for expiry */
|
2010-02-10 18:15:53 +07:00
|
|
|
#define UMOUNT_NOFOLLOW 0x00000008 /* Don't follow symlink on umount */
|
|
|
|
#define UMOUNT_UNUSED 0x80000000 /* Flag guaranteed to be unused */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-06-17 05:48:31 +07:00
|
|
|
/* sb->s_iflags */
|
|
|
|
#define SB_I_CGROUPWB 0x00000001 /* cgroup-aware writeback enabled */
|
2015-06-30 02:42:03 +07:00
|
|
|
#define SB_I_NOEXEC 0x00000002 /* Ignore executables on this fs */
|
2016-06-10 03:34:02 +07:00
|
|
|
#define SB_I_NODEV 0x00000004 /* Ignore devices on this fs */
|
2018-03-15 06:20:29 +07:00
|
|
|
#define SB_I_MULTIROOT 0x00000008 /* Multiple roots to the dentry tree */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2016-06-10 04:06:06 +07:00
|
|
|
/* sb->s_iflags to limit user namespace mounts */
|
|
|
|
#define SB_I_USERNS_VISIBLE 0x00000010 /* fstype already mounted */
|
2018-02-21 23:33:37 +07:00
|
|
|
#define SB_I_IMA_UNVERIFIABLE_SIGNATURE 0x00000020
|
|
|
|
#define SB_I_UNTRUSTED_MOUNTER 0x00000040
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2020-04-09 15:29:47 +07:00
|
|
|
#define SB_I_SKIP_SYNC 0x00000100 /* Skip superblock at global sync */
|
|
|
|
|
2012-06-12 21:20:34 +07:00
|
|
|
/* Possible states of 'frozen' field */
|
|
|
|
enum {
|
|
|
|
SB_UNFROZEN = 0, /* FS is unfrozen */
|
|
|
|
SB_FREEZE_WRITE = 1, /* Writes, dir ops, ioctls frozen */
|
|
|
|
SB_FREEZE_PAGEFAULT = 2, /* Page faults stopped as well */
|
|
|
|
SB_FREEZE_FS = 3, /* For internal FS use (e.g. to stop
|
|
|
|
* internal threads if needed) */
|
|
|
|
SB_FREEZE_COMPLETE = 4, /* ->freeze_fs finished successfully */
|
|
|
|
};
|
|
|
|
|
|
|
|
#define SB_FREEZE_LEVELS (SB_FREEZE_COMPLETE - 1)
|
|
|
|
|
|
|
|
struct sb_writers {
|
2015-08-11 22:05:04 +07:00
|
|
|
int frozen; /* Is sb frozen? */
|
|
|
|
wait_queue_head_t wait_unfrozen; /* for get_super_thawed() */
|
|
|
|
struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS];
|
2012-06-12 21:20:34 +07:00
|
|
|
};
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
struct super_block {
|
|
|
|
struct list_head s_list; /* Keep this first */
|
|
|
|
dev_t s_dev; /* search index; _not_ kdev_t */
|
2010-01-26 21:12:43 +07:00
|
|
|
unsigned char s_blocksize_bits;
|
|
|
|
unsigned long s_blocksize;
|
2009-09-19 03:05:53 +07:00
|
|
|
loff_t s_maxbytes; /* Max file size */
|
2005-04-17 05:20:36 +07:00
|
|
|
struct file_system_type *s_type;
|
2007-02-12 15:55:41 +07:00
|
|
|
const struct super_operations *s_op;
|
2009-09-22 07:01:08 +07:00
|
|
|
const struct dquot_operations *dq_op;
|
2009-09-22 07:01:09 +07:00
|
|
|
const struct quotactl_ops *s_qcop;
|
2007-10-22 06:42:17 +07:00
|
|
|
const struct export_operations *s_export_op;
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned long s_flags;
|
2015-06-17 05:48:31 +07:00
|
|
|
unsigned long s_iflags; /* internal SB_I_* flags */
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned long s_magic;
|
|
|
|
struct dentry *s_root;
|
|
|
|
struct rw_semaphore s_umount;
|
|
|
|
int s_count;
|
|
|
|
atomic_t s_active;
|
[PATCH] fs.h: ifdef security fields
[assuming BSD security levels are deleted]
The only user of i_security, f_security, s_security fields is SELinux,
however, quite a few security modules are trying to get into kernel.
So, wrap them under CONFIG_SECURITY. Adding config option for each
security field is likely an overkill.
Following Stephen Smalley's suggestion, i_security initialization is
moved to security_inode_alloc() to not clutter core code with ifdefs
and make alloc_inode() codepath tiny little bit smaller and faster.
The user of (highly greppable) struct fown_struct::security field is
still to be found. I've checked every "fown_struct" and every "f_owner"
occurence. Additionally it's removal doesn't break i386 allmodconfig
build.
struct inode, struct file, struct super_block, struct fown_struct
become smaller.
P.S. Combined with two reiserfs inode shrinking patches sent to
linux-fsdevel, I can finally suck 12 reiserfs inodes into one page.
/proc/slabinfo
-ext2_inode_cache 388 10
+ext2_inode_cache 384 10
-inode_cache 280 14
+inode_cache 276 14
-proc_inode_cache 296 13
+proc_inode_cache 292 13
-reiser_inode_cache 336 11
+reiser_inode_cache 332 12 <=
-shmem_inode_cache 372 10
+shmem_inode_cache 368 10
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 16:00:01 +07:00
|
|
|
#ifdef CONFIG_SECURITY
|
2005-04-17 05:20:36 +07:00
|
|
|
void *s_security;
|
[PATCH] fs.h: ifdef security fields
[assuming BSD security levels are deleted]
The only user of i_security, f_security, s_security fields is SELinux,
however, quite a few security modules are trying to get into kernel.
So, wrap them under CONFIG_SECURITY. Adding config option for each
security field is likely an overkill.
Following Stephen Smalley's suggestion, i_security initialization is
moved to security_inode_alloc() to not clutter core code with ifdefs
and make alloc_inode() codepath tiny little bit smaller and faster.
The user of (highly greppable) struct fown_struct::security field is
still to be found. I've checked every "fown_struct" and every "f_owner"
occurence. Additionally it's removal doesn't break i386 allmodconfig
build.
struct inode, struct file, struct super_block, struct fown_struct
become smaller.
P.S. Combined with two reiserfs inode shrinking patches sent to
linux-fsdevel, I can finally suck 12 reiserfs inodes into one page.
/proc/slabinfo
-ext2_inode_cache 388 10
+ext2_inode_cache 384 10
-inode_cache 280 14
+inode_cache 276 14
-proc_inode_cache 296 13
+proc_inode_cache 292 13
-reiser_inode_cache 336 11
+reiser_inode_cache 332 12 <=
-shmem_inode_cache 372 10
+shmem_inode_cache 368 10
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 16:00:01 +07:00
|
|
|
#endif
|
2010-05-14 07:53:14 +07:00
|
|
|
const struct xattr_handler **s_xattr;
|
2018-12-12 16:50:12 +07:00
|
|
|
#ifdef CONFIG_FS_ENCRYPTION
|
2015-05-16 06:26:10 +07:00
|
|
|
const struct fscrypt_operations *s_cop;
|
fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl
Add a new fscrypt ioctl, FS_IOC_ADD_ENCRYPTION_KEY. This ioctl adds an
encryption key to the filesystem's fscrypt keyring ->s_master_keys,
making any files encrypted with that key appear "unlocked".
Why we need this
~~~~~~~~~~~~~~~~
The main problem is that the "locked/unlocked" (ciphertext/plaintext)
status of encrypted files is global, but the fscrypt keys are not.
fscrypt only looks for keys in the keyring(s) the process accessing the
filesystem is subscribed to: the thread keyring, process keyring, and
session keyring, where the session keyring may contain the user keyring.
Therefore, userspace has to put fscrypt keys in the keyrings for
individual users or sessions. But this means that when a process with a
different keyring tries to access encrypted files, whether they appear
"unlocked" or not is nondeterministic. This is because it depends on
whether the files are currently present in the inode cache.
Fixing this by consistently providing each process its own view of the
filesystem depending on whether it has the key or not isn't feasible due
to how the VFS caches work. Furthermore, while sometimes users expect
this behavior, it is misguided for two reasons. First, it would be an
OS-level access control mechanism largely redundant with existing access
control mechanisms such as UNIX file permissions, ACLs, LSMs, etc.
Encryption is actually for protecting the data at rest.
Second, almost all users of fscrypt actually do need the keys to be
global. The largest users of fscrypt, Android and Chromium OS, achieve
this by having PID 1 create a "session keyring" that is inherited by
every process. This works, but it isn't scalable because it prevents
session keyrings from being used for any other purpose.
On general-purpose Linux distros, the 'fscrypt' userspace tool [1] can't
similarly abuse the session keyring, so to make 'sudo' work on all
systems it has to link all the user keyrings into root's user keyring
[2]. This is ugly and raises security concerns. Moreover it can't make
the keys available to system services, such as sshd trying to access the
user's '~/.ssh' directory (see [3], [4]) or NetworkManager trying to
read certificates from the user's home directory (see [5]); or to Docker
containers (see [6], [7]).
By having an API to add a key to the *filesystem* we'll be able to fix
the above bugs, remove userspace workarounds, and clearly express the
intended semantics: the locked/unlocked status of an encrypted directory
is global, and encryption is orthogonal to OS-level access control.
Why not use the add_key() syscall
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We use an ioctl for this API rather than the existing add_key() system
call because the ioctl gives us the flexibility needed to implement
fscrypt-specific semantics that will be introduced in later patches:
- Supporting key removal with the semantics such that the secret is
removed immediately and any unused inodes using the key are evicted;
also, the eviction of any in-use inodes can be retried.
- Calculating a key-dependent cryptographic identifier and returning it
to userspace.
- Allowing keys to be added and removed by non-root users, but only keys
for v2 encryption policies; and to prevent denial-of-service attacks,
users can only remove keys they themselves have added, and a key is
only really removed after all users who added it have removed it.
Trying to shoehorn these semantics into the keyrings syscalls would be
very difficult, whereas the ioctls make things much easier.
However, to reuse code the implementation still uses the keyrings
service internally. Thus we get lockless RCU-mode key lookups without
having to re-implement it, and the keys automatically show up in
/proc/keys for debugging purposes.
References:
[1] https://github.com/google/fscrypt
[2] https://goo.gl/55cCrI#heading=h.vf09isp98isb
[3] https://github.com/google/fscrypt/issues/111#issuecomment-444347939
[4] https://github.com/google/fscrypt/issues/116
[5] https://bugs.launchpad.net/ubuntu/+source/fscrypt/+bug/1770715
[6] https://github.com/google/fscrypt/issues/128
[7] https://askubuntu.com/questions/1130306/cannot-run-docker-on-an-encrypted-filesystem
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-05 09:35:46 +07:00
|
|
|
struct key *s_master_keys; /* master crypto keys in use */
|
2019-07-22 23:26:21 +07:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_FS_VERITY
|
|
|
|
const struct fsverity_operations *s_vop;
|
2018-05-01 05:51:35 +07:00
|
|
|
#endif
|
VFS: don't keep disconnected dentries on d_anon
The original purpose of the per-superblock d_anon list was to
keep disconnected dentries in the cache between consecutive
requests to the NFS server. Dentries can be disconnected if
a client holds a file open and repeatedly performs IO on it,
and if the server drops the dentry, whether due to memory
pressure, server restart, or "echo 3 > /proc/sys/vm/drop_caches".
This purpose was thwarted by commit 75a6f82a0d10 ("freeing unlinked
file indefinitely delayed") which caused disconnected dentries
to be freed as soon as their refcount reached zero.
This means that, when a dentry being used by nfsd gets disconnected, a
new one needs to be allocated for every request (unless requests
overlap). As the dentry has no name, no parent, and no children,
there is little of value to cache. As small memory allocations are
typically fast (from per-cpu free lists) this likely has little cost.
This means that the original purpose of s_anon is no longer relevant:
there is no longer any need to keep disconnected dentries on a list so
they appear to be hashed.
However, s_anon now has a new use. When you mount an NFS filesystem,
the dentry stored in s_root is just a placebo. The "real" root dentry
is allocated using d_obtain_root() and so it kept on the s_anon list.
I don't know the reason for this, but suspect it related to NFSv4
where a mount of "server:/some/path" require NFS to look up the root
filehandle on the server, then walk down "/some" and "/path" to get
the filehandle to mount.
Whatever the reason, NFS depends on the s_anon list and on
shrink_dcache_for_umount() pruning all dentries on this list. So we
cannot simply remove s_anon.
We could just leave the code unchanged, but apart from that being
potentially confusing, the (unfair) bit-spin-lock which protects
s_anon can become a bottle neck when lots of disconnected dentries are
being created.
So this patch renames s_anon to s_roots, and stops storing
disconnected dentries on the list. Only dentries obtained with
d_obtain_root() are now stored on this list. There are many fewer of
these (only NFS and NILFS2 use the call, and only during filesystem
mount) so contention on the bit-lock will not be a problem.
Possibly an alternate solution should be found for NFS and NILFS2, but
that would require understanding their needs first.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-21 05:45:40 +07:00
|
|
|
struct hlist_bl_head s_roots; /* alternate root dentries for NFS */
|
2011-11-21 18:11:30 +07:00
|
|
|
struct list_head s_mounts; /* list of mounts; _not_ for fs use */
|
2005-04-17 05:20:36 +07:00
|
|
|
struct block_device *s_bdev;
|
2009-09-16 20:02:33 +07:00
|
|
|
struct backing_dev_info *s_bdi;
|
2007-05-11 12:51:50 +07:00
|
|
|
struct mtd_info *s_mtd;
|
2011-12-13 10:53:00 +07:00
|
|
|
struct hlist_node s_instances;
|
2014-09-30 15:43:09 +07:00
|
|
|
unsigned int s_quota_types; /* Bitmask of supported quota types */
|
2005-04-17 05:20:36 +07:00
|
|
|
struct quota_info s_dquot; /* Diskquota specific options */
|
|
|
|
|
2012-06-12 21:20:34 +07:00
|
|
|
struct sb_writers s_writers;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
fs: group frequently accessed fields of struct super_block together
Kernel test robot reported [1] a 6% performance regression in a
concurrent unlink(2) workload on commit 60f7ed8c7c4d ("fsnotify: send
path type events to group with super block marks").
The performance test was run with no fsnotify marks at all on the
data set, so the only extra instructions added by the offending
commit are tests of the super_block fields s_fsnotify_{marks,mask}
and these tests happen on almost every single inode access.
When adding those fields to the super_block struct, we did not give much
thought of placing them on a hot cache lines (we just placed them at the
end of the struct).
Re-organize struct super_block to try and keep some frequently accessed
fields on the same cache line.
Move the frequently accessed fields s_fsnotify_{marks,mask} near the
frequently accessed fields s_fs_info,s_time_gran, while filling a 64bit
alignment hole after s_time_gran.
Move the seldom accessed fields s_id,s_uuid,s_max_links,s_mode near the
seldom accessed fields s_vfs_rename_mutex,s_subtype.
Rong Chen confirmed that this patch solved the reported problem.
[1] https://lkml.org/lkml/2018/9/30/206
Reported-by: kernel test robot <rong.a.chen@intel.com>
Tested-by: kernel test robot <rong.a.chen@intel.com>
Fixes: 1e6cb72399 ("fsnotify: add super block object type")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-18 18:22:55 +07:00
|
|
|
/*
|
|
|
|
* Keep s_fs_info, s_time_gran, s_fsnotify_mask, and
|
|
|
|
* s_fsnotify_marks together for cache efficiency. They are frequently
|
|
|
|
* accessed and rarely modified.
|
|
|
|
*/
|
|
|
|
void *s_fs_info; /* Filesystem private info */
|
|
|
|
|
|
|
|
/* Granularity of c/m/atime in ns (cannot be worse than a second) */
|
|
|
|
u32 s_time_gran;
|
2018-01-22 09:04:23 +07:00
|
|
|
/* Time limits for c/m/atime in seconds */
|
|
|
|
time64_t s_time_min;
|
|
|
|
time64_t s_time_max;
|
fs: group frequently accessed fields of struct super_block together
Kernel test robot reported [1] a 6% performance regression in a
concurrent unlink(2) workload on commit 60f7ed8c7c4d ("fsnotify: send
path type events to group with super block marks").
The performance test was run with no fsnotify marks at all on the
data set, so the only extra instructions added by the offending
commit are tests of the super_block fields s_fsnotify_{marks,mask}
and these tests happen on almost every single inode access.
When adding those fields to the super_block struct, we did not give much
thought of placing them on a hot cache lines (we just placed them at the
end of the struct).
Re-organize struct super_block to try and keep some frequently accessed
fields on the same cache line.
Move the frequently accessed fields s_fsnotify_{marks,mask} near the
frequently accessed fields s_fs_info,s_time_gran, while filling a 64bit
alignment hole after s_time_gran.
Move the seldom accessed fields s_id,s_uuid,s_max_links,s_mode near the
seldom accessed fields s_vfs_rename_mutex,s_subtype.
Rong Chen confirmed that this patch solved the reported problem.
[1] https://lkml.org/lkml/2018/9/30/206
Reported-by: kernel test robot <rong.a.chen@intel.com>
Tested-by: kernel test robot <rong.a.chen@intel.com>
Fixes: 1e6cb72399 ("fsnotify: add super block object type")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-18 18:22:55 +07:00
|
|
|
#ifdef CONFIG_FSNOTIFY
|
|
|
|
__u32 s_fsnotify_mask;
|
|
|
|
struct fsnotify_mark_connector __rcu *s_fsnotify_marks;
|
|
|
|
#endif
|
|
|
|
|
2017-05-10 20:06:33 +07:00
|
|
|
char s_id[32]; /* Informational name */
|
|
|
|
uuid_t s_uuid; /* UUID */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-02-07 00:45:27 +07:00
|
|
|
unsigned int s_max_links;
|
2008-02-23 07:50:45 +07:00
|
|
|
fmode_t s_mode;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The next field is for VFS *only*. No filesystems have any business
|
|
|
|
* even looking at it. You had been warned.
|
|
|
|
*/
|
2006-03-23 18:00:33 +07:00
|
|
|
struct mutex s_vfs_rename_mutex; /* Kludge */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
add filesystem subtype support
There's a slight problem with filesystem type representation in fuse
based filesystems.
From the kernel's view, there are just two filesystem types: fuse and
fuseblk. From the user's view there are lots of different filesystem
types. The user is not even much concerned if the filesystem is fuse based
or not. So there's a conflict of interest in how this should be
represented in fstab, mtab and /proc/mounts.
The current scheme is to encode the real filesystem type in the mount
source. So an sshfs mount looks like this:
sshfs#user@server:/ /mnt/server fuse rw,nosuid,nodev,...
This url-ish syntax works OK for sshfs and similar filesystems. However
for block device based filesystems (ntfs-3g, zfs) it doesn't work, since
the kernel expects the mount source to be a real device name.
A possibly better scheme would be to encode the real type in the type
field as "type.subtype". So fuse mounts would look like this:
/dev/hda1 /mnt/windows fuseblk.ntfs-3g rw,...
user@server:/ /mnt/server fuse.sshfs rw,nosuid,nodev,...
This patch adds the necessary code to the kernel so that this can be
correctly displayed in /proc/mounts.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 14:25:43 +07:00
|
|
|
/*
|
|
|
|
* Filesystem subtype. If non-empty the filesystem type field
|
|
|
|
* in /proc/mounts will be "type.subtype"
|
|
|
|
*/
|
2018-11-04 19:18:51 +07:00
|
|
|
const char *s_subtype;
|
2008-02-08 19:21:35 +07:00
|
|
|
|
2010-12-18 22:22:30 +07:00
|
|
|
const struct dentry_operations *s_d_op; /* default d_op for dentries */
|
2011-05-26 23:01:19 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Saved pool identifier for cleancache (-1 means none)
|
|
|
|
*/
|
|
|
|
int cleancache_poolid;
|
2011-07-08 11:14:42 +07:00
|
|
|
|
|
|
|
struct shrinker s_shrink; /* per-sb shrinker handle */
|
2011-11-21 18:11:31 +07:00
|
|
|
|
2011-11-21 18:11:32 +07:00
|
|
|
/* Number of inodes with nlink == 0 but still referenced */
|
|
|
|
atomic_long_t s_remove_count;
|
|
|
|
|
2018-10-17 18:07:05 +07:00
|
|
|
/* Pending fsnotify inode refs */
|
|
|
|
atomic_long_t s_fsnotify_inode_refs;
|
|
|
|
|
2011-11-21 18:11:31 +07:00
|
|
|
/* Being remounted read-only */
|
|
|
|
int s_readonly_remount;
|
2013-09-04 20:04:39 +07:00
|
|
|
|
vfs: track per-sb writeback errors and report them to syncfs
Patch series "vfs: have syncfs() return error when there are writeback
errors", v6.
Currently, syncfs does not return errors when one of the inodes fails to
be written back. It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.
The basic idea in this set is to track writeback errors at the
superblock level, so that we can quickly and easily check whether
something bad happened without having to fsync each file individually.
syncfs is then changed to reliably report writeback errors after they
occur, much in the same fashion as fsync does now.
This patch (of 2):
Usually we suggest that applications call fsync when they want to ensure
that all data written to the file has made it to the backing store, but
that can be inefficient when there are a lot of open files.
Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect. If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error. syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.
It would be better if syncfs reported an error if there were any
writeback failures. Then applications could call syncfs to see if there
are any errors on any open files, and could then call fsync on all of
the other descriptors to figure out which one failed.
This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.
To report those errors, we also need to keep an errseq_t in struct file
to act as a cursor. This patch adds a dedicated field for that purpose,
which slots nicely into 4 bytes of padding at the end of struct file on
x86_64.
An earlier version of this patch used an O_PATH file descriptor to cue
the kernel that the open file should track the superblock error and not
the inode's writeback error.
I think that API is just too weird though. This is simpler and should
make syncfs error reporting "just work" even if someone is multiplexing
fsync and syncfs on the same fds.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 11:45:36 +07:00
|
|
|
/* per-sb errseq_t for reporting writeback errors via syncfs */
|
|
|
|
errseq_t s_wb_err;
|
|
|
|
|
2013-09-04 20:04:39 +07:00
|
|
|
/* AIO completions deferred from interrupt context */
|
|
|
|
struct workqueue_struct *s_dio_done_wq;
|
2014-08-07 17:23:41 +07:00
|
|
|
struct hlist_head s_pins;
|
2013-08-28 07:18:00 +07:00
|
|
|
|
2016-05-24 21:29:01 +07:00
|
|
|
/*
|
|
|
|
* Owning user namespace and default context in which to
|
|
|
|
* interpret filesystem uids, gids, quotas, device nodes,
|
|
|
|
* xattrs and security labels.
|
|
|
|
*/
|
|
|
|
struct user_namespace *s_user_ns;
|
|
|
|
|
2013-08-28 07:18:00 +07:00
|
|
|
/*
|
2019-01-31 01:52:37 +07:00
|
|
|
* The list_lru structure is essentially just a pointer to a table
|
|
|
|
* of per-node lru lists, each of which has its own spinlock.
|
|
|
|
* There is no need to put them into separate cachelines.
|
2013-08-28 07:18:00 +07:00
|
|
|
*/
|
2019-01-31 01:52:37 +07:00
|
|
|
struct list_lru s_dentry_lru;
|
|
|
|
struct list_lru s_inode_lru;
|
2013-10-05 04:06:56 +07:00
|
|
|
struct rcu_head rcu;
|
2015-07-23 01:21:13 +07:00
|
|
|
struct work_struct destroy_work;
|
2014-10-24 05:14:39 +07:00
|
|
|
|
2015-03-05 01:40:00 +07:00
|
|
|
struct mutex s_sync_lock; /* sync serialisation lock */
|
2014-10-24 05:14:39 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Indicates how deep in a filesystem stack this SB is
|
|
|
|
*/
|
|
|
|
int s_stack_depth;
|
2015-03-05 00:37:22 +07:00
|
|
|
|
|
|
|
/* s_inode_list_lock protects s_inodes */
|
|
|
|
spinlock_t s_inode_list_lock ____cacheline_aligned_in_smp;
|
|
|
|
struct list_head s_inodes; /* all inodes */
|
2016-07-27 05:21:50 +07:00
|
|
|
|
|
|
|
spinlock_t s_inode_wblist_lock;
|
|
|
|
struct list_head s_inodes_wb; /* writeback inodes */
|
2016-10-28 15:22:25 +07:00
|
|
|
} __randomize_layout;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2014-12-11 23:15:45 +07:00
|
|
|
/* Helper functions so that in most cases filesystems will
|
|
|
|
* not need to deal directly with kuid_t and kgid_t and can
|
|
|
|
* instead deal with the raw numeric values that are stored
|
|
|
|
* in the filesystem.
|
|
|
|
*/
|
|
|
|
static inline uid_t i_uid_read(const struct inode *inode)
|
|
|
|
{
|
|
|
|
return from_kuid(inode->i_sb->s_user_ns, inode->i_uid);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline gid_t i_gid_read(const struct inode *inode)
|
|
|
|
{
|
|
|
|
return from_kgid(inode->i_sb->s_user_ns, inode->i_gid);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void i_uid_write(struct inode *inode, uid_t uid)
|
|
|
|
{
|
|
|
|
inode->i_uid = make_kuid(inode->i_sb->s_user_ns, uid);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void i_gid_write(struct inode *inode, gid_t gid)
|
|
|
|
{
|
|
|
|
inode->i_gid = make_kgid(inode->i_sb->s_user_ns, gid);
|
|
|
|
}
|
|
|
|
|
vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.
The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.
The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.
virtual patch
@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}
@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}
@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}
@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }
@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }
@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)
<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)
@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}
@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-05-09 09:36:02 +07:00
|
|
|
extern struct timespec64 current_time(struct inode *inode);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Snapshotting support.
|
|
|
|
*/
|
|
|
|
|
2012-06-12 21:20:34 +07:00
|
|
|
void __sb_end_write(struct super_block *sb, int level);
|
|
|
|
int __sb_start_write(struct super_block *sb, int level, bool wait);
|
|
|
|
|
2015-07-20 04:48:20 +07:00
|
|
|
#define __sb_writers_acquired(sb, lev) \
|
2015-08-11 22:05:04 +07:00
|
|
|
percpu_rwsem_acquire(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
|
2015-07-20 04:48:20 +07:00
|
|
|
#define __sb_writers_release(sb, lev) \
|
2015-08-11 22:05:04 +07:00
|
|
|
percpu_rwsem_release(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
|
2015-07-20 04:48:20 +07:00
|
|
|
|
2012-06-12 21:20:34 +07:00
|
|
|
/**
|
|
|
|
* sb_end_write - drop write access to a superblock
|
|
|
|
* @sb: the super we wrote to
|
|
|
|
*
|
|
|
|
* Decrement number of writers to the filesystem. Wake up possible waiters
|
|
|
|
* wanting to freeze the filesystem.
|
|
|
|
*/
|
|
|
|
static inline void sb_end_write(struct super_block *sb)
|
|
|
|
{
|
|
|
|
__sb_end_write(sb, SB_FREEZE_WRITE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* sb_end_pagefault - drop write access to a superblock from a page fault
|
|
|
|
* @sb: the super we wrote to
|
|
|
|
*
|
|
|
|
* Decrement number of processes handling write page fault to the filesystem.
|
|
|
|
* Wake up possible waiters wanting to freeze the filesystem.
|
|
|
|
*/
|
|
|
|
static inline void sb_end_pagefault(struct super_block *sb)
|
|
|
|
{
|
|
|
|
__sb_end_write(sb, SB_FREEZE_PAGEFAULT);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* sb_end_intwrite - drop write access to a superblock for internal fs purposes
|
|
|
|
* @sb: the super we wrote to
|
|
|
|
*
|
|
|
|
* Decrement fs-internal number of writers to the filesystem. Wake up possible
|
|
|
|
* waiters wanting to freeze the filesystem.
|
|
|
|
*/
|
|
|
|
static inline void sb_end_intwrite(struct super_block *sb)
|
|
|
|
{
|
|
|
|
__sb_end_write(sb, SB_FREEZE_FS);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* sb_start_write - get write access to a superblock
|
|
|
|
* @sb: the super we write to
|
|
|
|
*
|
|
|
|
* When a process wants to write data or metadata to a file system (i.e. dirty
|
|
|
|
* a page or an inode), it should embed the operation in a sb_start_write() -
|
|
|
|
* sb_end_write() pair to get exclusion against file system freezing. This
|
|
|
|
* function increments number of writers preventing freezing. If the file
|
|
|
|
* system is already frozen, the function waits until the file system is
|
|
|
|
* thawed.
|
|
|
|
*
|
|
|
|
* Since freeze protection behaves as a lock, users have to preserve
|
|
|
|
* ordering of freeze protection and other filesystem locks. Generally,
|
|
|
|
* freeze protection should be the outermost lock. In particular, we have:
|
|
|
|
*
|
|
|
|
* sb_start_write
|
|
|
|
* -> i_mutex (write path, truncate, directory ops, ...)
|
|
|
|
* -> s_umount (freeze_super, thaw_super)
|
|
|
|
*/
|
|
|
|
static inline void sb_start_write(struct super_block *sb)
|
|
|
|
{
|
|
|
|
__sb_start_write(sb, SB_FREEZE_WRITE, true);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int sb_start_write_trylock(struct super_block *sb)
|
|
|
|
{
|
|
|
|
return __sb_start_write(sb, SB_FREEZE_WRITE, false);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* sb_start_pagefault - get write access to a superblock from a page fault
|
|
|
|
* @sb: the super we write to
|
|
|
|
*
|
|
|
|
* When a process starts handling write page fault, it should embed the
|
|
|
|
* operation into sb_start_pagefault() - sb_end_pagefault() pair to get
|
|
|
|
* exclusion against file system freezing. This is needed since the page fault
|
|
|
|
* is going to dirty a page. This function increments number of running page
|
|
|
|
* faults preventing freezing. If the file system is already frozen, the
|
|
|
|
* function waits until the file system is thawed.
|
|
|
|
*
|
|
|
|
* Since page fault freeze protection behaves as a lock, users have to preserve
|
|
|
|
* ordering of freeze protection and other filesystem locks. It is advised to
|
2020-06-09 11:33:54 +07:00
|
|
|
* put sb_start_pagefault() close to mmap_lock in lock ordering. Page fault
|
2012-06-12 21:20:34 +07:00
|
|
|
* handling code implies lock dependency:
|
|
|
|
*
|
2020-06-09 11:33:54 +07:00
|
|
|
* mmap_lock
|
2012-06-12 21:20:34 +07:00
|
|
|
* -> sb_start_pagefault
|
|
|
|
*/
|
|
|
|
static inline void sb_start_pagefault(struct super_block *sb)
|
|
|
|
{
|
|
|
|
__sb_start_write(sb, SB_FREEZE_PAGEFAULT, true);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* sb_start_intwrite - get write access to a superblock for internal fs purposes
|
|
|
|
* @sb: the super we write to
|
|
|
|
*
|
|
|
|
* This is the third level of protection against filesystem freezing. It is
|
|
|
|
* free for use by a filesystem. The only requirement is that it must rank
|
|
|
|
* below sb_start_pagefault.
|
|
|
|
*
|
|
|
|
* For example filesystem can call sb_start_intwrite() when starting a
|
|
|
|
* transaction which somewhat eases handling of freezing for internal sources
|
|
|
|
* of filesystem changes (internal fs threads, discarding preallocation on file
|
|
|
|
* close, etc.).
|
|
|
|
*/
|
|
|
|
static inline void sb_start_intwrite(struct super_block *sb)
|
|
|
|
{
|
|
|
|
__sb_start_write(sb, SB_FREEZE_FS, true);
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2018-05-14 09:40:30 +07:00
|
|
|
static inline int sb_start_intwrite_trylock(struct super_block *sb)
|
|
|
|
{
|
|
|
|
return __sb_start_write(sb, SB_FREEZE_FS, false);
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-03-24 06:43:26 +07:00
|
|
|
extern bool inode_owner_or_capable(const struct inode *inode);
|
2007-07-17 16:30:08 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* VFS helper functions..
|
|
|
|
*/
|
2012-06-11 05:09:36 +07:00
|
|
|
extern int vfs_create(struct inode *, struct dentry *, umode_t, bool);
|
2011-07-26 12:41:39 +07:00
|
|
|
extern int vfs_mkdir(struct inode *, struct dentry *, umode_t);
|
2011-07-26 12:52:52 +07:00
|
|
|
extern int vfs_mknod(struct inode *, struct dentry *, umode_t, dev_t);
|
2008-06-24 21:50:16 +07:00
|
|
|
extern int vfs_symlink(struct inode *, struct dentry *, const char *);
|
2011-09-21 04:14:31 +07:00
|
|
|
extern int vfs_link(struct dentry *, struct inode *, struct dentry *, struct inode **);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern int vfs_rmdir(struct inode *, struct dentry *);
|
2011-09-20 20:14:34 +07:00
|
|
|
extern int vfs_unlink(struct inode *, struct dentry *, struct inode **);
|
2014-04-01 22:08:42 +07:00
|
|
|
extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *, struct inode **, unsigned int);
|
2020-05-14 21:44:23 +07:00
|
|
|
|
|
|
|
static inline int vfs_whiteout(struct inode *dir, struct dentry *dentry)
|
|
|
|
{
|
|
|
|
return vfs_mknod(dir, dentry, S_IFCHR | WHITEOUT_MODE, WHITEOUT_DEV);
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2017-01-17 11:34:52 +07:00
|
|
|
extern struct dentry *vfs_tmpfile(struct dentry *dentry, umode_t mode,
|
|
|
|
int open_flag);
|
|
|
|
|
2017-12-02 05:12:45 +07:00
|
|
|
int vfs_mkobj(struct dentry *, umode_t,
|
|
|
|
int (*f)(struct dentry *, umode_t, void *),
|
|
|
|
void *);
|
|
|
|
|
2020-07-14 13:47:43 +07:00
|
|
|
int vfs_fchown(struct file *file, uid_t user, gid_t group);
|
2020-07-14 13:55:05 +07:00
|
|
|
int vfs_fchmod(struct file *file, umode_t mode);
|
2020-07-15 13:19:55 +07:00
|
|
|
int vfs_utimes(const struct path *path, struct timespec64 *times);
|
2020-07-14 13:47:43 +07:00
|
|
|
|
2018-07-18 20:44:40 +07:00
|
|
|
extern long vfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
|
|
|
|
|
2018-09-11 21:55:03 +07:00
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
extern long compat_ptr_ioctl(struct file *file, unsigned int cmd,
|
|
|
|
unsigned long arg);
|
|
|
|
#else
|
|
|
|
#define compat_ptr_ioctl NULL
|
|
|
|
#endif
|
|
|
|
|
2005-11-09 12:35:04 +07:00
|
|
|
/*
|
|
|
|
* VFS file helper functions.
|
|
|
|
*/
|
2010-03-04 21:29:14 +07:00
|
|
|
extern void inode_init_owner(struct inode *inode, const struct inode *dir,
|
2011-07-25 10:20:18 +07:00
|
|
|
umode_t mode);
|
2016-06-10 03:34:02 +07:00
|
|
|
extern bool may_open_dev(const struct path *path);
|
2008-10-09 06:44:18 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* This is the "filldir" function type, used by readdir() to let
|
|
|
|
* the kernel specify what kind of dirent layout it wants to have.
|
|
|
|
* This allows the kernel to read directories into kernel space or
|
|
|
|
* to have different dirent layouts depending on the binary type.
|
|
|
|
*/
|
2014-10-30 23:37:34 +07:00
|
|
|
struct dir_context;
|
|
|
|
typedef int (*filldir_t)(struct dir_context *, const char *, int, loff_t, u64,
|
|
|
|
unsigned);
|
|
|
|
|
2013-05-16 00:52:59 +07:00
|
|
|
struct dir_context {
|
2018-04-10 03:12:30 +07:00
|
|
|
filldir_t actor;
|
2013-05-16 05:49:12 +07:00
|
|
|
loff_t pos;
|
2013-05-16 00:52:59 +07:00
|
|
|
};
|
2013-05-16 05:49:12 +07:00
|
|
|
|
2015-01-14 16:42:32 +07:00
|
|
|
/*
|
|
|
|
* These flags let !MMU mmap() govern direct device mapping vs immediate
|
|
|
|
* copying more easily for MAP_PRIVATE, especially for ROM filesystems.
|
|
|
|
*
|
|
|
|
* NOMMU_MAP_COPY: Copy can be mapped (MAP_PRIVATE)
|
|
|
|
* NOMMU_MAP_DIRECT: Can be mapped directly (MAP_SHARED)
|
|
|
|
* NOMMU_MAP_READ: Can be mapped for reading
|
|
|
|
* NOMMU_MAP_WRITE: Can be mapped for writing
|
|
|
|
* NOMMU_MAP_EXEC: Can be mapped for execution
|
|
|
|
*/
|
|
|
|
#define NOMMU_MAP_COPY 0x00000001
|
|
|
|
#define NOMMU_MAP_DIRECT 0x00000008
|
|
|
|
#define NOMMU_MAP_READ VM_MAYREAD
|
|
|
|
#define NOMMU_MAP_WRITE VM_MAYWRITE
|
|
|
|
#define NOMMU_MAP_EXEC VM_MAYEXEC
|
|
|
|
|
|
|
|
#define NOMMU_VMFLAGS \
|
|
|
|
(NOMMU_MAP_READ | NOMMU_MAP_WRITE | NOMMU_MAP_EXEC)
|
|
|
|
|
2018-10-30 06:41:21 +07:00
|
|
|
/*
|
|
|
|
* These flags control the behavior of the remap_file_range function pointer.
|
|
|
|
* If it is called with len == 0 that means "remap to end of source file".
|
2019-06-08 01:54:35 +07:00
|
|
|
* See Documentation/filesystems/vfs.rst for more details about this call.
|
2018-10-30 06:41:21 +07:00
|
|
|
*
|
|
|
|
* REMAP_FILE_DEDUP: only remap if contents identical (i.e. deduplicate)
|
2018-10-30 06:42:10 +07:00
|
|
|
* REMAP_FILE_CAN_SHORTEN: caller can handle a shortened request
|
2018-10-30 06:41:21 +07:00
|
|
|
*/
|
|
|
|
#define REMAP_FILE_DEDUP (1 << 0)
|
2018-10-30 06:42:10 +07:00
|
|
|
#define REMAP_FILE_CAN_SHORTEN (1 << 1)
|
2018-10-30 06:41:21 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* These flags signal that the caller is ok with altering various aspects of
|
|
|
|
* the behavior of the remap operation. The changes must be made by the
|
|
|
|
* implementation; the vfs remap helper functions can take advantage of them.
|
|
|
|
* Flags in this category exist to preserve the quirky behavior of the hoisted
|
|
|
|
* btrfs clone/dedupe ioctls.
|
|
|
|
*/
|
2018-10-30 06:42:10 +07:00
|
|
|
#define REMAP_FILE_ADVISORY (REMAP_FILE_CAN_SHORTEN)
|
2015-01-14 16:42:32 +07:00
|
|
|
|
2014-02-12 06:37:41 +07:00
|
|
|
struct iov_iter;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
struct file_operations {
|
|
|
|
struct module *owner;
|
|
|
|
loff_t (*llseek) (struct file *, loff_t, int);
|
|
|
|
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
|
|
|
|
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
|
2014-02-12 06:37:41 +07:00
|
|
|
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
|
|
|
|
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
|
2018-11-22 22:37:38 +07:00
|
|
|
int (*iopoll)(struct kiocb *kiocb, bool spin);
|
2013-05-16 05:49:12 +07:00
|
|
|
int (*iterate) (struct file *, struct dir_context *);
|
2016-04-21 10:08:32 +07:00
|
|
|
int (*iterate_shared) (struct file *, struct dir_context *);
|
2017-07-03 09:22:01 +07:00
|
|
|
__poll_t (*poll) (struct file *, struct poll_table_struct *);
|
2005-04-17 05:20:36 +07:00
|
|
|
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
|
|
|
|
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
|
|
|
|
int (*mmap) (struct file *, struct vm_area_struct *);
|
mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC need a mechanism to
define new behavior that is known to fail on older kernels without the
support. Define a new MAP_SHARED_VALIDATE flag pattern that is
guaranteed to fail on all legacy mmap implementations.
It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that could not be supported by all
archs Linus observed:
I see why you *think* you want a bitmap. You think you want
a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
etc, so that people can do
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_SYNC, fd, 0);
and "know" that MAP_SYNC actually takes.
And I'm saying that whole wish is bogus. You're fundamentally
depending on special semantics, just make it explicit. It's already
not portable, so don't try to make it so.
Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
of 0x3, and make people do
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
| MAP_SYNC, fd, 0);
and then the kernel side is easier too (none of that random garbage
playing games with looking at the "MAP_VALIDATE bit", but just another
case statement in that map type thing.
Boom. Done.
Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis. Towards that end arrange for flags to be generically
validated against a mmap_supported_flags exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.
Cc: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-01 22:36:30 +07:00
|
|
|
unsigned long mmap_supported_flags;
|
2005-04-17 05:20:36 +07:00
|
|
|
int (*open) (struct inode *, struct file *);
|
2006-06-23 16:05:12 +07:00
|
|
|
int (*flush) (struct file *, fl_owner_t id);
|
2005-04-17 05:20:36 +07:00
|
|
|
int (*release) (struct inode *, struct file *);
|
2011-07-17 07:44:56 +07:00
|
|
|
int (*fsync) (struct file *, loff_t, loff_t, int datasync);
|
2005-04-17 05:20:36 +07:00
|
|
|
int (*fasync) (int, struct file *, int);
|
|
|
|
int (*lock) (struct file *, int, struct file_lock *);
|
|
|
|
ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
|
|
|
|
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
|
|
|
|
int (*check_flags)(int);
|
|
|
|
int (*flock) (struct file *, int, struct file_lock *);
|
2006-04-11 19:57:50 +07:00
|
|
|
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
|
|
|
|
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
|
2014-08-22 21:40:25 +07:00
|
|
|
int (*setlease)(struct file *, long, struct file_lock **, void **);
|
2011-01-14 19:07:43 +07:00
|
|
|
long (*fallocate)(struct file *file, int mode, loff_t offset,
|
|
|
|
loff_t len);
|
2014-09-30 06:08:25 +07:00
|
|
|
void (*show_fdinfo)(struct seq_file *m, struct file *f);
|
2015-01-14 16:42:32 +07:00
|
|
|
#ifndef CONFIG_MMU
|
|
|
|
unsigned (*mmap_capabilities)(struct file *);
|
|
|
|
#endif
|
2015-12-03 18:59:50 +07:00
|
|
|
ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
|
|
|
|
loff_t, size_t, unsigned int);
|
2018-10-30 06:41:49 +07:00
|
|
|
loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
|
|
|
|
struct file *file_out, loff_t pos_out,
|
|
|
|
loff_t len, unsigned int remap_flags);
|
2018-08-27 19:56:02 +07:00
|
|
|
int (*fadvise)(struct file *, loff_t, loff_t, int);
|
2016-10-28 15:22:25 +07:00
|
|
|
} __randomize_layout;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
struct inode_operations {
|
2012-06-11 04:13:09 +07:00
|
|
|
struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
|
2015-12-30 03:58:39 +07:00
|
|
|
const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *);
|
2011-06-21 06:28:19 +07:00
|
|
|
int (*permission) (struct inode *, int);
|
2011-07-23 22:37:31 +07:00
|
|
|
struct posix_acl * (*get_acl)(struct inode *, int);
|
2011-01-07 13:49:56 +07:00
|
|
|
|
|
|
|
int (*readlink) (struct dentry *, char __user *,int);
|
|
|
|
|
2012-06-11 05:05:36 +07:00
|
|
|
int (*create) (struct inode *,struct dentry *, umode_t, bool);
|
2005-04-17 05:20:36 +07:00
|
|
|
int (*link) (struct dentry *,struct inode *,struct dentry *);
|
|
|
|
int (*unlink) (struct inode *,struct dentry *);
|
|
|
|
int (*symlink) (struct inode *,struct dentry *,const char *);
|
2011-07-26 12:41:39 +07:00
|
|
|
int (*mkdir) (struct inode *,struct dentry *,umode_t);
|
2005-04-17 05:20:36 +07:00
|
|
|
int (*rmdir) (struct inode *,struct dentry *);
|
2011-07-26 12:52:52 +07:00
|
|
|
int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
|
2005-04-17 05:20:36 +07:00
|
|
|
int (*rename) (struct inode *, struct dentry *,
|
2014-04-01 22:08:42 +07:00
|
|
|
struct inode *, struct dentry *, unsigned int);
|
2005-04-17 05:20:36 +07:00
|
|
|
int (*setattr) (struct dentry *, struct iattr *);
|
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-31 23:46:22 +07:00
|
|
|
int (*getattr) (const struct path *, struct kstat *, u32, unsigned int);
|
2005-04-17 05:20:36 +07:00
|
|
|
ssize_t (*listxattr) (struct dentry *, char *, size_t);
|
2008-10-09 06:44:18 +07:00
|
|
|
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
|
|
|
|
u64 len);
|
vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.
The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.
The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.
virtual patch
@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}
@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}
@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}
@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }
@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }
@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)
<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)
@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}
@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-05-09 09:36:02 +07:00
|
|
|
int (*update_time)(struct inode *, struct timespec64 *, int);
|
2012-06-22 15:39:14 +07:00
|
|
|
int (*atomic_open)(struct inode *, struct dentry *,
|
2012-06-22 15:40:19 +07:00
|
|
|
struct file *, unsigned open_flag,
|
2018-06-09 00:32:02 +07:00
|
|
|
umode_t create_mode);
|
2013-06-07 12:20:27 +07:00
|
|
|
int (*tmpfile) (struct inode *, struct dentry *, umode_t);
|
2013-12-20 20:16:39 +07:00
|
|
|
int (*set_acl)(struct inode *, struct posix_acl *, int);
|
2011-01-07 13:49:56 +07:00
|
|
|
} ____cacheline_aligned;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2017-02-20 22:51:23 +07:00
|
|
|
static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
|
|
|
|
struct iov_iter *iter)
|
|
|
|
{
|
|
|
|
return file->f_op->read_iter(kio, iter);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
|
|
|
|
struct iov_iter *iter)
|
|
|
|
{
|
|
|
|
return file->f_op->write_iter(kio, iter);
|
|
|
|
}
|
|
|
|
|
2017-02-20 22:51:23 +07:00
|
|
|
static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
return file->f_op->mmap(file, vma);
|
|
|
|
}
|
|
|
|
|
2006-10-01 13:28:49 +07:00
|
|
|
ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
|
2011-11-01 07:06:39 +07:00
|
|
|
unsigned long nr_segs, unsigned long fast_segs,
|
|
|
|
struct iovec *fast_pointer,
|
2012-06-01 06:26:42 +07:00
|
|
|
struct iovec **ret_pointer);
|
2006-10-01 13:28:49 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
|
|
|
|
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
|
|
|
|
extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
|
2017-07-06 23:58:37 +07:00
|
|
|
unsigned long, loff_t *, rwf_t);
|
2015-11-11 04:53:30 +07:00
|
|
|
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
|
|
|
|
loff_t, size_t, unsigned int);
|
2019-06-05 22:04:47 +07:00
|
|
|
extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
|
|
|
|
struct file *file_out, loff_t pos_out,
|
|
|
|
size_t len, unsigned int flags);
|
2018-10-30 06:41:08 +07:00
|
|
|
extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
|
|
|
|
struct file *file_out, loff_t pos_out,
|
2018-10-30 06:41:49 +07:00
|
|
|
loff_t *count,
|
|
|
|
unsigned int remap_flags);
|
|
|
|
extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
|
|
|
|
struct file *file_out, loff_t pos_out,
|
2018-10-30 06:41:56 +07:00
|
|
|
loff_t len, unsigned int remap_flags);
|
2018-10-30 06:41:49 +07:00
|
|
|
extern loff_t vfs_clone_file_range(struct file *file_in, loff_t pos_in,
|
|
|
|
struct file *file_out, loff_t pos_out,
|
2018-10-30 06:41:56 +07:00
|
|
|
loff_t len, unsigned int remap_flags);
|
2015-12-19 15:55:59 +07:00
|
|
|
extern int vfs_dedupe_file_range(struct file *file,
|
|
|
|
struct file_dedupe_range *same);
|
2018-10-30 06:41:49 +07:00
|
|
|
extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
|
|
|
|
struct file *dst_file, loff_t dst_pos,
|
2018-10-30 06:42:03 +07:00
|
|
|
loff_t len, unsigned int remap_flags);
|
2018-07-18 20:44:40 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
struct super_operations {
|
|
|
|
struct inode *(*alloc_inode)(struct super_block *sb);
|
|
|
|
void (*destroy_inode)(struct inode *);
|
2019-04-11 01:43:44 +07:00
|
|
|
void (*free_inode)(struct inode *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-05-27 17:53:02 +07:00
|
|
|
void (*dirty_inode) (struct inode *, int flags);
|
2010-03-05 15:21:37 +07:00
|
|
|
int (*write_inode) (struct inode *, struct writeback_control *wbc);
|
2010-06-08 00:43:19 +07:00
|
|
|
int (*drop_inode) (struct inode *);
|
2010-06-05 06:40:39 +07:00
|
|
|
void (*evict_inode) (struct inode *);
|
2005-04-17 05:20:36 +07:00
|
|
|
void (*put_super) (struct super_block *);
|
|
|
|
int (*sync_fs)(struct super_block *sb, int wait);
|
fs: add freeze_super/thaw_super fs hooks
Currently, freezing a filesystem involves calling freeze_super, which locks
sb->s_umount and then calls the fs-specific freeze_fs hook. This makes it
hard for gfs2 (and potentially other cluster filesystems) to use the vfs
freezing code to do freezes on all the cluster nodes.
In order to communicate that a freeze has been requested, and to make sure
that only one node is trying to freeze at a time, gfs2 uses a glock
(sd_freeze_gl). The problem is that there is no hook for gfs2 to acquire
this lock before calling freeze_super. This means that two nodes can
attempt to freeze the filesystem by both calling freeze_super, acquiring
the sb->s_umount lock, and then attempting to grab the cluster glock
sd_freeze_gl. Only one will succeed, and the other will be stuck in
freeze_super, making it impossible to finish freezing the node.
To solve this problem, this patch adds the freeze_super and thaw_super
hooks. If a filesystem implements these hooks, they are called instead of
the vfs freeze_super and thaw_super functions. This means that every
filesystem that implements these hooks must call the vfs freeze_super and
thaw_super functions itself within the hook function to make use of the vfs
freezing code.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-14 09:42:03 +07:00
|
|
|
int (*freeze_super) (struct super_block *);
|
2009-01-10 07:40:58 +07:00
|
|
|
int (*freeze_fs) (struct super_block *);
|
fs: add freeze_super/thaw_super fs hooks
Currently, freezing a filesystem involves calling freeze_super, which locks
sb->s_umount and then calls the fs-specific freeze_fs hook. This makes it
hard for gfs2 (and potentially other cluster filesystems) to use the vfs
freezing code to do freezes on all the cluster nodes.
In order to communicate that a freeze has been requested, and to make sure
that only one node is trying to freeze at a time, gfs2 uses a glock
(sd_freeze_gl). The problem is that there is no hook for gfs2 to acquire
this lock before calling freeze_super. This means that two nodes can
attempt to freeze the filesystem by both calling freeze_super, acquiring
the sb->s_umount lock, and then attempting to grab the cluster glock
sd_freeze_gl. Only one will succeed, and the other will be stuck in
freeze_super, making it impossible to finish freezing the node.
To solve this problem, this patch adds the freeze_super and thaw_super
hooks. If a filesystem implements these hooks, they are called instead of
the vfs freeze_super and thaw_super functions. This means that every
filesystem that implements these hooks must call the vfs freeze_super and
thaw_super functions itself within the hook function to make use of the vfs
freezing code.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-14 09:42:03 +07:00
|
|
|
int (*thaw_super) (struct super_block *);
|
2009-01-10 07:40:58 +07:00
|
|
|
int (*unfreeze_fs) (struct super_block *);
|
2006-06-23 16:02:58 +07:00
|
|
|
int (*statfs) (struct dentry *, struct kstatfs *);
|
2005-04-17 05:20:36 +07:00
|
|
|
int (*remount_fs) (struct super_block *, int *, char *);
|
2008-04-24 18:21:56 +07:00
|
|
|
void (*umount_begin) (struct super_block *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-12-09 09:32:45 +07:00
|
|
|
int (*show_options)(struct seq_file *, struct dentry *);
|
2011-12-09 09:32:45 +07:00
|
|
|
int (*show_devname)(struct seq_file *, struct dentry *);
|
2011-12-09 09:37:57 +07:00
|
|
|
int (*show_path)(struct seq_file *, struct dentry *);
|
2011-12-09 08:51:13 +07:00
|
|
|
int (*show_stats)(struct seq_file *, struct dentry *);
|
2006-09-29 15:59:56 +07:00
|
|
|
#ifdef CONFIG_QUOTA
|
2005-04-17 05:20:36 +07:00
|
|
|
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
|
|
|
|
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
|
2014-09-25 21:36:14 +07:00
|
|
|
struct dquot **(*get_dquots)(struct inode *);
|
2006-09-29 15:59:56 +07:00
|
|
|
#endif
|
2009-01-03 21:47:09 +07:00
|
|
|
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
|
2015-02-13 05:58:51 +07:00
|
|
|
long (*nr_cached_objects)(struct super_block *,
|
|
|
|
struct shrink_control *);
|
|
|
|
long (*free_cached_objects)(struct super_block *,
|
|
|
|
struct shrink_control *);
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
2012-10-15 22:40:35 +07:00
|
|
|
/*
|
|
|
|
* Inode flags - they have no relation to superblock flags now
|
|
|
|
*/
|
2020-07-13 10:09:52 +07:00
|
|
|
#define S_SYNC (1 << 0) /* Writes are synced at once */
|
|
|
|
#define S_NOATIME (1 << 1) /* Do not update access times */
|
|
|
|
#define S_APPEND (1 << 2) /* Append-only file */
|
|
|
|
#define S_IMMUTABLE (1 << 3) /* Immutable file */
|
|
|
|
#define S_DEAD (1 << 4) /* removed, but still open directory */
|
|
|
|
#define S_NOQUOTA (1 << 5) /* Inode is not counted to quota */
|
|
|
|
#define S_DIRSYNC (1 << 6) /* Directory modifications are synchronous */
|
|
|
|
#define S_NOCMTIME (1 << 7) /* Do not update file c/mtime */
|
|
|
|
#define S_SWAPFILE (1 << 8) /* Do not truncate: swapon got its bmaps */
|
|
|
|
#define S_PRIVATE (1 << 9) /* Inode is fs-internal */
|
|
|
|
#define S_IMA (1 << 10) /* Inode has an associated IMA struct */
|
|
|
|
#define S_AUTOMOUNT (1 << 11) /* Automount/referral quasi-directory */
|
|
|
|
#define S_NOSEC (1 << 12) /* no suid or xattr security attributes */
|
2015-02-17 06:59:25 +07:00
|
|
|
#ifdef CONFIG_FS_DAX
|
2020-07-13 10:09:52 +07:00
|
|
|
#define S_DAX (1 << 13) /* Direct Access, avoiding the page cache */
|
2015-02-17 06:58:53 +07:00
|
|
|
#else
|
2020-07-13 10:09:52 +07:00
|
|
|
#define S_DAX 0 /* Make all the DAX code disappear */
|
2015-02-17 06:58:53 +07:00
|
|
|
#endif
|
2020-07-13 10:09:52 +07:00
|
|
|
#define S_ENCRYPTED (1 << 14) /* Encrypted file (using fs/crypto/) */
|
|
|
|
#define S_CASEFOLD (1 << 15) /* Casefolded file */
|
|
|
|
#define S_VERITY (1 << 16) /* Verity file (using fs/verity/) */
|
2012-10-15 22:40:35 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Note that nosuid etc flags are inode-specific: setting some file-system
|
|
|
|
* flags just means all the inodes inherit those flags by default. It might be
|
|
|
|
* possible to override it selectively if you really wanted to with some
|
|
|
|
* ioctl() that is not currently implemented.
|
|
|
|
*
|
2017-07-17 14:45:35 +07:00
|
|
|
* Exception: SB_RDONLY is always applied to the entire file system.
|
2012-10-15 22:40:35 +07:00
|
|
|
*
|
|
|
|
* Unfortunately, it is possible to change a filesystems flags with it mounted
|
|
|
|
* with files in use. This means that all of the inodes will not have their
|
|
|
|
* i_flags updated. Hence, i_flags no longer inherit the superblock mount
|
|
|
|
* flags, so these have to be checked separately. -- rmk@arm.uk.linux.org
|
|
|
|
*/
|
|
|
|
#define __IS_FLG(inode, flg) ((inode)->i_sb->s_flags & (flg))
|
|
|
|
|
2017-11-28 04:05:09 +07:00
|
|
|
static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags & SB_RDONLY; }
|
2017-07-17 14:45:34 +07:00
|
|
|
#define IS_RDONLY(inode) sb_rdonly((inode)->i_sb)
|
2017-07-17 14:45:35 +07:00
|
|
|
#define IS_SYNC(inode) (__IS_FLG(inode, SB_SYNCHRONOUS) || \
|
2012-10-15 22:40:35 +07:00
|
|
|
((inode)->i_flags & S_SYNC))
|
2017-07-17 14:45:35 +07:00
|
|
|
#define IS_DIRSYNC(inode) (__IS_FLG(inode, SB_SYNCHRONOUS|SB_DIRSYNC) || \
|
2012-10-15 22:40:35 +07:00
|
|
|
((inode)->i_flags & (S_SYNC|S_DIRSYNC)))
|
2017-07-17 14:45:35 +07:00
|
|
|
#define IS_MANDLOCK(inode) __IS_FLG(inode, SB_MANDLOCK)
|
|
|
|
#define IS_NOATIME(inode) __IS_FLG(inode, SB_RDONLY|SB_NOATIME)
|
|
|
|
#define IS_I_VERSION(inode) __IS_FLG(inode, SB_I_VERSION)
|
2012-10-15 22:40:35 +07:00
|
|
|
|
|
|
|
#define IS_NOQUOTA(inode) ((inode)->i_flags & S_NOQUOTA)
|
|
|
|
#define IS_APPEND(inode) ((inode)->i_flags & S_APPEND)
|
|
|
|
#define IS_IMMUTABLE(inode) ((inode)->i_flags & S_IMMUTABLE)
|
2017-07-17 14:45:35 +07:00
|
|
|
#define IS_POSIXACL(inode) __IS_FLG(inode, SB_POSIXACL)
|
2012-10-15 22:40:35 +07:00
|
|
|
|
|
|
|
#define IS_DEADDIR(inode) ((inode)->i_flags & S_DEAD)
|
|
|
|
#define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
|
|
|
|
#define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
|
|
|
|
#define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
|
|
|
|
#define IS_IMA(inode) ((inode)->i_flags & S_IMA)
|
|
|
|
#define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
|
|
|
|
#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
|
2015-02-17 06:58:53 +07:00
|
|
|
#define IS_DAX(inode) ((inode)->i_flags & S_DAX)
|
2017-10-10 02:15:35 +07:00
|
|
|
#define IS_ENCRYPTED(inode) ((inode)->i_flags & S_ENCRYPTED)
|
ext4: Support case-insensitive file name lookups
This patch implements the actual support for case-insensitive file name
lookups in ext4, based on the feature bit and the encoding stored in the
superblock.
A filesystem that has the casefold feature set is able to configure
directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups
to succeed in that directory in a case-insensitive fashion, i.e: match
a directory entry even if the name used by userspace is not a byte per
byte match with the disk name, but is an equivalent case-insensitive
version of the Unicode string. This operation is called a
case-insensitive file name lookup.
The feature is configured as an inode attribute applied to directories
and inherited by its children. This attribute can only be enabled on
empty directories for filesystems that support the encoding feature,
thus preventing collision of file names that only differ by case.
* dcache handling:
For a +F directory, Ext4 only stores the first equivalent name dentry
used in the dcache. This is done to prevent unintentional duplication of
dentries in the dcache, while also allowing the VFS code to quickly find
the right entry in the cache despite which equivalent string was used in
a previous lookup, without having to resort to ->lookup().
d_hash() of casefolded directories is implemented as the hash of the
casefolded string, such that we always have a well-known bucket for all
the equivalencies of the same string. d_compare() uses the
utf8_strncasecmp() infrastructure, which handles the comparison of
equivalent, same case, names as well.
For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries. This is bad for performance but requires some leveraging of
the vfs layer to fix. We can live without that for now, and so does
everyone else.
* on-disk data:
Despite using a specific version of the name as the internal
representation within the dcache, the name stored and fetched from the
disk is a byte-per-byte match with what the user requested, making this
implementation 'name-preserving'. i.e. no actual information is lost
when writing to storage.
DX is supported by modifying the hashes used in +F directories to make
them case/encoding-aware. The new disk hashes are calculated as the
hash of the full casefolded string, instead of the string directly.
This allows us to efficiently search for file names in the htree without
requiring the user to provide an exact name.
* Dealing with invalid sequences:
By default, when a invalid UTF-8 sequence is identified, ext4 will treat
it as an opaque byte sequence, ignoring the encoding and reverting to
the old behavior for that unique file. This means that case-insensitive
file name lookup will not work only for that file. An optional bit can
be set in the superblock telling the filesystem code and userspace tools
to enforce the encoding. When that optional bit is set, any attempt to
create a file name using an invalid UTF-8 sequence will fail and return
an error to userspace.
* Normalization algorithm:
The UTF-8 algorithms used to compare strings in ext4 is implemented
lives in fs/unicode, and is based on a previous version developed by
SGI. It implements the Canonical decomposition (NFD) algorithm
described by the Unicode specification 12.1, or higher, combined with
the elimination of ignorable code points (NFDi) and full
case-folding (CF) as documented in fs/unicode/utf8_norm.c.
NFD seems to be the best normalization method for EXT4 because:
- It has a lower cost than NFC/NFKC (which requires
decomposing to NFD as an intermediary step)
- It doesn't eliminate important semantic meaning like
compatibility decompositions.
Although:
- This implementation is not completely linguistic accurate, because
different languages have conflicting rules, which would require the
specialization of the filesystem to a given locale, which brings all
sorts of problems for removable media and for users who use more than
one language.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-26 01:12:08 +07:00
|
|
|
#define IS_CASEFOLDED(inode) ((inode)->i_flags & S_CASEFOLD)
|
2019-07-22 23:26:21 +07:00
|
|
|
#define IS_VERITY(inode) ((inode)->i_flags & S_VERITY)
|
2012-10-15 22:40:35 +07:00
|
|
|
|
2014-10-24 05:14:36 +07:00
|
|
|
#define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \
|
|
|
|
(inode)->i_rdev == WHITEOUT_DEV)
|
|
|
|
|
2016-06-30 02:54:46 +07:00
|
|
|
static inline bool HAS_UNMAPPED_ID(struct inode *inode)
|
|
|
|
{
|
|
|
|
return !uid_valid(inode->i_uid) || !gid_valid(inode->i_gid);
|
|
|
|
}
|
|
|
|
|
fs: add fcntl() interface for setting/getting write life time hints
Define a set of write life time hints:
RWH_WRITE_LIFE_NOT_SET No hint information set
RWH_WRITE_LIFE_NONE No hints about write life time
RWH_WRITE_LIFE_SHORT Data written has a short life time
RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
RWH_WRITE_LIFE_LONG Data written has a long life time
RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time
The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.
Add an fcntl interface for querying these flags, and also for
setting them as well:
F_GET_RW_HINT Returns the read/write hint set on the
underlying inode.
F_SET_RW_HINT Set one of the above write hints on the
underlying inode.
F_GET_FILE_RW_HINT Returns the read/write hint set on the
file descriptor.
F_SET_FILE_RW_HINT Set one of the above write hints on the
file descriptor.
The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.
Sample program testing/implementing basic setting/getting of write
hints is below.
Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.
This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.
/*
* writehint.c: get or set an inode write hint
*/
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdbool.h>
#include <inttypes.h>
#ifndef F_GET_RW_HINT
#define F_LINUX_SPECIFIC_BASE 1024
#define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
#define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
#endif
static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };
int main(int argc, char *argv[])
{
uint64_t hint;
int fd, ret;
if (argc < 2) {
fprintf(stderr, "%s: file <hint>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 2;
}
if (argc > 2) {
hint = atoi(argv[2]);
ret = fcntl(fd, F_SET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_SET_RW_HINT");
return 4;
}
}
ret = fcntl(fd, F_GET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_GET_RW_HINT");
return 3;
}
printf("%s: hint %s\n", argv[1], str[hint]);
close(fd);
return 0;
}
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28 00:47:04 +07:00
|
|
|
static inline enum rw_hint file_write_hint(struct file *file)
|
|
|
|
{
|
|
|
|
if (file->f_write_hint != WRITE_LIFE_NOT_SET)
|
|
|
|
return file->f_write_hint;
|
|
|
|
|
|
|
|
return file_inode(file)->i_write_hint;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int iocb_flags(struct file *file);
|
|
|
|
|
2018-05-23 00:52:18 +07:00
|
|
|
static inline u16 ki_hint_validate(enum rw_hint hint)
|
|
|
|
{
|
|
|
|
typeof(((struct kiocb *)0)->ki_hint) max_hint = -1;
|
|
|
|
|
|
|
|
if (hint <= max_hint)
|
|
|
|
return hint;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
fs: add fcntl() interface for setting/getting write life time hints
Define a set of write life time hints:
RWH_WRITE_LIFE_NOT_SET No hint information set
RWH_WRITE_LIFE_NONE No hints about write life time
RWH_WRITE_LIFE_SHORT Data written has a short life time
RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
RWH_WRITE_LIFE_LONG Data written has a long life time
RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time
The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.
Add an fcntl interface for querying these flags, and also for
setting them as well:
F_GET_RW_HINT Returns the read/write hint set on the
underlying inode.
F_SET_RW_HINT Set one of the above write hints on the
underlying inode.
F_GET_FILE_RW_HINT Returns the read/write hint set on the
file descriptor.
F_SET_FILE_RW_HINT Set one of the above write hints on the
file descriptor.
The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.
Sample program testing/implementing basic setting/getting of write
hints is below.
Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.
This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.
/*
* writehint.c: get or set an inode write hint
*/
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdbool.h>
#include <inttypes.h>
#ifndef F_GET_RW_HINT
#define F_LINUX_SPECIFIC_BASE 1024
#define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
#define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
#endif
static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };
int main(int argc, char *argv[])
{
uint64_t hint;
int fd, ret;
if (argc < 2) {
fprintf(stderr, "%s: file <hint>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 2;
}
if (argc > 2) {
hint = atoi(argv[2]);
ret = fcntl(fd, F_SET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_SET_RW_HINT");
return 4;
}
}
ret = fcntl(fd, F_GET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_GET_RW_HINT");
return 3;
}
printf("%s: hint %s\n", argv[1], str[hint]);
close(fd);
return 0;
}
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28 00:47:04 +07:00
|
|
|
static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
|
|
|
|
{
|
|
|
|
*kiocb = (struct kiocb) {
|
|
|
|
.ki_filp = filp,
|
|
|
|
.ki_flags = iocb_flags(filp),
|
2018-05-23 00:52:18 +07:00
|
|
|
.ki_hint = ki_hint_validate(file_write_hint(filp)),
|
2018-11-20 08:52:38 +07:00
|
|
|
.ki_ioprio = get_current_ioprio(),
|
fs: add fcntl() interface for setting/getting write life time hints
Define a set of write life time hints:
RWH_WRITE_LIFE_NOT_SET No hint information set
RWH_WRITE_LIFE_NONE No hints about write life time
RWH_WRITE_LIFE_SHORT Data written has a short life time
RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
RWH_WRITE_LIFE_LONG Data written has a long life time
RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time
The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.
Add an fcntl interface for querying these flags, and also for
setting them as well:
F_GET_RW_HINT Returns the read/write hint set on the
underlying inode.
F_SET_RW_HINT Set one of the above write hints on the
underlying inode.
F_GET_FILE_RW_HINT Returns the read/write hint set on the
file descriptor.
F_SET_FILE_RW_HINT Set one of the above write hints on the
file descriptor.
The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.
Sample program testing/implementing basic setting/getting of write
hints is below.
Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.
This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.
/*
* writehint.c: get or set an inode write hint
*/
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdbool.h>
#include <inttypes.h>
#ifndef F_GET_RW_HINT
#define F_LINUX_SPECIFIC_BASE 1024
#define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
#define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
#endif
static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };
int main(int argc, char *argv[])
{
uint64_t hint;
int fd, ret;
if (argc < 2) {
fprintf(stderr, "%s: file <hint>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 2;
}
if (argc > 2) {
hint = atoi(argv[2]);
ret = fcntl(fd, F_SET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_SET_RW_HINT");
return 4;
}
}
ret = fcntl(fd, F_GET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_GET_RW_HINT");
return 3;
}
printf("%s: hint %s\n", argv[1], str[hint]);
close(fd);
return 0;
}
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28 00:47:04 +07:00
|
|
|
};
|
|
|
|
}
|
|
|
|
|
2019-11-20 16:45:25 +07:00
|
|
|
static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
|
|
|
|
struct file *filp)
|
|
|
|
{
|
|
|
|
*kiocb = (struct kiocb) {
|
|
|
|
.ki_filp = filp,
|
|
|
|
.ki_flags = kiocb_src->ki_flags,
|
|
|
|
.ki_hint = kiocb_src->ki_hint,
|
|
|
|
.ki_ioprio = kiocb_src->ki_ioprio,
|
|
|
|
.ki_pos = kiocb_src->ki_pos,
|
|
|
|
};
|
|
|
|
}
|
|
|
|
|
2007-10-17 13:30:44 +07:00
|
|
|
/*
|
2011-03-22 18:23:36 +07:00
|
|
|
* Inode state bits. Protected by inode->i_lock
|
2007-10-17 13:30:44 +07:00
|
|
|
*
|
|
|
|
* Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
|
|
|
|
* I_DIRTY_DATASYNC and I_DIRTY_PAGES.
|
|
|
|
*
|
|
|
|
* Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
|
|
|
|
* until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
|
|
|
|
* various stages of removing an inode.
|
|
|
|
*
|
2009-12-17 20:25:01 +07:00
|
|
|
* Two bits are used for locking and completion notification, I_NEW and I_SYNC.
|
2007-10-17 13:30:44 +07:00
|
|
|
*
|
2008-02-06 16:36:59 +07:00
|
|
|
* I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
|
|
|
|
* fdatasync(). i_atime is the usual cause.
|
2008-02-15 10:31:32 +07:00
|
|
|
* I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of
|
|
|
|
* these changes separately from I_DIRTY_SYNC so that we
|
|
|
|
* don't have to write inode on fdatasync() when only
|
|
|
|
* mtime has changed in it.
|
2007-10-17 13:30:44 +07:00
|
|
|
* I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean.
|
2009-12-17 20:25:01 +07:00
|
|
|
* I_NEW Serves as both a mutex and completion notification.
|
|
|
|
* New inodes set I_NEW. If two processes both create
|
|
|
|
* the same inode, one of them will release its inode and
|
|
|
|
* wait for I_NEW to be released before returning.
|
|
|
|
* Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can
|
|
|
|
* also cause waiting on I_NEW, without I_NEW actually
|
|
|
|
* being set. find_inode() uses this to prevent returning
|
|
|
|
* nearly-dead inodes.
|
2007-10-17 13:30:44 +07:00
|
|
|
* I_WILL_FREE Must be set when calling write_inode_now() if i_count
|
|
|
|
* is zero. I_FREEING must be set when I_WILL_FREE is
|
|
|
|
* cleared.
|
|
|
|
* I_FREEING Set when inode is about to be freed but still has dirty
|
|
|
|
* pages or buffers attached or the inode itself is still
|
|
|
|
* dirty.
|
2012-05-03 19:48:02 +07:00
|
|
|
* I_CLEAR Added by clear_inode(). In this state the inode is
|
|
|
|
* clean and can be destroyed. Inode keeps I_FREEING.
|
2007-10-17 13:30:44 +07:00
|
|
|
*
|
|
|
|
* Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are
|
|
|
|
* prohibited for many purposes. iget() must wait for
|
|
|
|
* the inode to be completely released, then create it
|
|
|
|
* anew. Other functions will just ignore such inodes,
|
2009-12-17 20:25:01 +07:00
|
|
|
* if appropriate. I_NEW is used for waiting.
|
2007-10-17 13:30:44 +07:00
|
|
|
*
|
2012-05-03 19:48:03 +07:00
|
|
|
* I_SYNC Writeback of inode is running. The bit is set during
|
|
|
|
* data writeback, and cleared with a wakeup on the bit
|
|
|
|
* address once it is done. The bit is also used to pin
|
|
|
|
* the inode in memory for flusher thread.
|
2007-10-17 13:30:44 +07:00
|
|
|
*
|
2011-06-25 01:29:43 +07:00
|
|
|
* I_REFERENCED Marks the inode as recently references on the LRU list.
|
|
|
|
*
|
|
|
|
* I_DIO_WAKEUP Never set. Only used as a key for wait_on_bit().
|
|
|
|
*
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
* I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to
|
|
|
|
* synchronize competing switching instances and to tell
|
2018-04-11 06:36:56 +07:00
|
|
|
* wb stat updates to grab the i_pages lock. See
|
2019-03-06 06:50:03 +07:00
|
|
|
* inode_switch_wbs_work_fn() for details.
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
*
|
2017-06-21 19:28:32 +07:00
|
|
|
* I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper
|
|
|
|
* and work dirs among overlayfs mounts.
|
|
|
|
*
|
2018-06-29 02:53:17 +07:00
|
|
|
* I_CREATING New object's inode in the middle of setting up.
|
|
|
|
*
|
2020-04-30 21:41:37 +07:00
|
|
|
* I_DONTCACHE Evict inode as soon as it is not used anymore.
|
|
|
|
*
|
2007-10-17 13:30:44 +07:00
|
|
|
* Q: What is the difference between I_WILL_FREE and I_FREEING?
|
|
|
|
*/
|
2010-10-23 17:55:17 +07:00
|
|
|
#define I_DIRTY_SYNC (1 << 0)
|
|
|
|
#define I_DIRTY_DATASYNC (1 << 1)
|
|
|
|
#define I_DIRTY_PAGES (1 << 2)
|
2009-12-17 20:25:01 +07:00
|
|
|
#define __I_NEW 3
|
|
|
|
#define I_NEW (1 << __I_NEW)
|
2010-10-23 17:55:17 +07:00
|
|
|
#define I_WILL_FREE (1 << 4)
|
|
|
|
#define I_FREEING (1 << 5)
|
|
|
|
#define I_CLEAR (1 << 6)
|
2009-12-17 20:25:01 +07:00
|
|
|
#define __I_SYNC 7
|
2007-10-17 13:30:44 +07:00
|
|
|
#define I_SYNC (1 << __I_SYNC)
|
2010-10-23 17:55:17 +07:00
|
|
|
#define I_REFERENCED (1 << 8)
|
2011-06-25 01:29:43 +07:00
|
|
|
#define __I_DIO_WAKEUP 9
|
2015-04-17 03:04:56 +07:00
|
|
|
#define I_DIO_WAKEUP (1 << __I_DIO_WAKEUP)
|
2013-06-11 11:34:36 +07:00
|
|
|
#define I_LINKABLE (1 << 10)
|
2015-02-02 12:37:00 +07:00
|
|
|
#define I_DIRTY_TIME (1 << 11)
|
|
|
|
#define __I_DIRTY_TIME_EXPIRED 12
|
|
|
|
#define I_DIRTY_TIME_EXPIRED (1 << __I_DIRTY_TIME_EXPIRED)
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
#define I_WB_SWITCH (1 << 13)
|
2018-06-29 02:53:17 +07:00
|
|
|
#define I_OVL_INUSE (1 << 14)
|
|
|
|
#define I_CREATING (1 << 15)
|
2020-04-30 21:41:37 +07:00
|
|
|
#define I_DONTCACHE (1 << 16)
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2018-02-21 22:54:49 +07:00
|
|
|
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
|
|
|
|
#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
|
2015-02-02 12:37:00 +07:00
|
|
|
#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
extern void __mark_inode_dirty(struct inode *, int);
|
|
|
|
static inline void mark_inode_dirty(struct inode *inode)
|
|
|
|
{
|
|
|
|
__mark_inode_dirty(inode, I_DIRTY);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mark_inode_dirty_sync(struct inode *inode)
|
|
|
|
{
|
|
|
|
__mark_inode_dirty(inode, I_DIRTY_SYNC);
|
|
|
|
}
|
|
|
|
|
2011-11-21 18:11:32 +07:00
|
|
|
extern void inc_nlink(struct inode *inode);
|
|
|
|
extern void drop_nlink(struct inode *inode);
|
|
|
|
extern void clear_nlink(struct inode *inode);
|
|
|
|
extern void set_nlink(struct inode *inode, unsigned int nlink);
|
2006-10-01 13:29:04 +07:00
|
|
|
|
|
|
|
static inline void inode_inc_link_count(struct inode *inode)
|
|
|
|
{
|
|
|
|
inc_nlink(inode);
|
2006-03-23 18:00:51 +07:00
|
|
|
mark_inode_dirty(inode);
|
|
|
|
}
|
|
|
|
|
2006-10-01 13:29:03 +07:00
|
|
|
static inline void inode_dec_link_count(struct inode *inode)
|
|
|
|
{
|
|
|
|
drop_nlink(inode);
|
2006-03-23 18:00:51 +07:00
|
|
|
mark_inode_dirty(inode);
|
|
|
|
}
|
|
|
|
|
2012-03-26 20:59:21 +07:00
|
|
|
enum file_time_flags {
|
|
|
|
S_ATIME = 1,
|
|
|
|
S_MTIME = 2,
|
|
|
|
S_CTIME = 4,
|
|
|
|
S_VERSION = 8,
|
|
|
|
};
|
|
|
|
|
2018-07-18 20:44:43 +07:00
|
|
|
extern bool atime_needs_update(const struct path *, struct inode *);
|
2013-07-16 21:15:46 +07:00
|
|
|
extern void touch_atime(const struct path *);
|
2005-04-17 05:20:36 +07:00
|
|
|
static inline void file_accessed(struct file *file)
|
|
|
|
{
|
|
|
|
if (!(file->f_flags & O_NOATIME))
|
2012-03-15 19:21:57 +07:00
|
|
|
touch_atime(&file->f_path);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2019-06-05 22:04:49 +07:00
|
|
|
extern int file_modified(struct file *file);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
int sync_inode(struct inode *inode, struct writeback_control *wbc);
|
2010-10-06 15:48:20 +07:00
|
|
|
int sync_inode_metadata(struct inode *inode, int wait);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
struct file_system_type {
|
|
|
|
const char *name;
|
|
|
|
int fs_flags;
|
2012-10-15 22:40:35 +07:00
|
|
|
#define FS_REQUIRES_DEV 1
|
|
|
|
#define FS_BINARY_MOUNTDATA 2
|
|
|
|
#define FS_HAS_SUBTYPE 4
|
2012-07-27 11:42:03 +07:00
|
|
|
#define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
|
2019-05-15 21:28:34 +07:00
|
|
|
#define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */
|
2012-10-15 22:40:35 +07:00
|
|
|
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
|
2018-12-24 06:55:56 +07:00
|
|
|
int (*init_fs_context)(struct fs_context *);
|
2019-09-07 18:23:15 +07:00
|
|
|
const struct fs_parameter_spec *parameters;
|
2010-07-25 03:17:56 +07:00
|
|
|
struct dentry *(*mount) (struct file_system_type *, int,
|
|
|
|
const char *, void *);
|
2005-04-17 05:20:36 +07:00
|
|
|
void (*kill_sb) (struct super_block *);
|
|
|
|
struct module *owner;
|
|
|
|
struct file_system_type * next;
|
2011-12-13 10:53:00 +07:00
|
|
|
struct hlist_head fs_supers;
|
2007-10-15 19:51:31 +07:00
|
|
|
|
2006-07-03 14:25:27 +07:00
|
|
|
struct lock_class_key s_lock_key;
|
2006-07-03 14:25:28 +07:00
|
|
|
struct lock_class_key s_umount_key;
|
2010-04-28 04:23:57 +07:00
|
|
|
struct lock_class_key s_vfs_rename_key;
|
2012-06-12 21:20:34 +07:00
|
|
|
struct lock_class_key s_writers_key[SB_FREEZE_LEVELS];
|
2007-10-15 19:51:31 +07:00
|
|
|
|
|
|
|
struct lock_class_key i_lock_key;
|
|
|
|
struct lock_class_key i_mutex_key;
|
2007-10-14 06:38:33 +07:00
|
|
|
struct lock_class_key i_mutex_dir_key;
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
2013-03-03 10:39:14 +07:00
|
|
|
#define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME)
|
|
|
|
|
2010-07-25 03:46:55 +07:00
|
|
|
extern struct dentry *mount_bdev(struct file_system_type *fs_type,
|
|
|
|
int flags, const char *dev_name, void *data,
|
|
|
|
int (*fill_super)(struct super_block *, void *, int));
|
2010-07-25 04:48:30 +07:00
|
|
|
extern struct dentry *mount_single(struct file_system_type *fs_type,
|
|
|
|
int flags, void *data,
|
|
|
|
int (*fill_super)(struct super_block *, void *, int));
|
2010-07-25 14:46:36 +07:00
|
|
|
extern struct dentry *mount_nodev(struct file_system_type *fs_type,
|
|
|
|
int flags, void *data,
|
|
|
|
int (*fill_super)(struct super_block *, void *, int));
|
2011-11-17 09:43:59 +07:00
|
|
|
extern struct dentry *mount_subtree(struct vfsmount *mnt, const char *path);
|
2005-04-17 05:20:36 +07:00
|
|
|
void generic_shutdown_super(struct super_block *sb);
|
|
|
|
void kill_block_super(struct super_block *sb);
|
|
|
|
void kill_anon_super(struct super_block *sb);
|
|
|
|
void kill_litter_super(struct super_block *sb);
|
|
|
|
void deactivate_super(struct super_block *sb);
|
2009-05-06 12:07:50 +07:00
|
|
|
void deactivate_locked_super(struct super_block *sb);
|
2005-04-17 05:20:36 +07:00
|
|
|
int set_anon_super(struct super_block *s, void *data);
|
2018-12-24 05:25:47 +07:00
|
|
|
int set_anon_super_fc(struct super_block *s, struct fs_context *fc);
|
2011-07-08 02:44:25 +07:00
|
|
|
int get_anon_bdev(dev_t *);
|
|
|
|
void free_anon_bdev(dev_t);
|
2018-12-24 05:25:47 +07:00
|
|
|
struct super_block *sget_fc(struct fs_context *fc,
|
|
|
|
int (*test)(struct super_block *, struct fs_context *),
|
|
|
|
int (*set)(struct super_block *, struct fs_context *));
|
2005-04-17 05:20:36 +07:00
|
|
|
struct super_block *sget(struct file_system_type *type,
|
|
|
|
int (*test)(struct super_block *,void *),
|
|
|
|
int (*set)(struct super_block *,void *),
|
2012-06-25 18:55:37 +07:00
|
|
|
int flags, void *data);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/* Alas, no aliases. Too much hassle with bringing module.h everywhere */
|
|
|
|
#define fops_get(fops) \
|
|
|
|
(((fops) && try_module_get((fops)->owner) ? (fops) : NULL))
|
|
|
|
#define fops_put(fops) \
|
|
|
|
do { if (fops) module_put((fops)->owner); } while(0)
|
2013-09-23 01:17:15 +07:00
|
|
|
/*
|
|
|
|
* This one is to be used *ONLY* from ->open() instances.
|
|
|
|
* fops must be non-NULL, pinned down *and* module dependencies
|
|
|
|
* should be sufficient to pin the caller down as well.
|
|
|
|
*/
|
|
|
|
#define replace_fops(f, fops) \
|
|
|
|
do { \
|
|
|
|
struct file *__file = (f); \
|
|
|
|
fops_put(__file->f_op); \
|
|
|
|
BUG_ON(!(__file->f_op = (fops))); \
|
|
|
|
} while(0)
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
extern int register_filesystem(struct file_system_type *);
|
|
|
|
extern int unregister_filesystem(struct file_system_type *);
|
2018-11-02 06:07:26 +07:00
|
|
|
extern struct vfsmount *kern_mount(struct file_system_type *);
|
2011-07-19 23:32:38 +07:00
|
|
|
extern void kern_unmount(struct vfsmount *mnt);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern int may_umount_tree(struct vfsmount *);
|
|
|
|
extern int may_umount(struct vfsmount *);
|
2014-09-14 20:15:10 +07:00
|
|
|
extern long do_mount(const char *, const char __user *,
|
|
|
|
const char *, unsigned long, void *);
|
2016-11-21 07:45:28 +07:00
|
|
|
extern struct vfsmount *collect_mounts(const struct path *);
|
2007-06-07 23:20:32 +07:00
|
|
|
extern void drop_collected_mounts(struct vfsmount *);
|
2010-01-31 10:51:25 +07:00
|
|
|
extern int iterate_mounts(int (*)(struct vfsmount *, void *), void *,
|
|
|
|
struct vfsmount *);
|
2016-11-21 08:27:12 +07:00
|
|
|
extern int vfs_statfs(const struct path *, struct kstatfs *);
|
2011-03-12 22:41:39 +07:00
|
|
|
extern int user_statfs(const char __user *, struct kstatfs *);
|
|
|
|
extern int fd_statfs(int, struct kstatfs *);
|
2010-03-23 21:34:56 +07:00
|
|
|
extern int freeze_super(struct super_block *super);
|
|
|
|
extern int thaw_super(struct super_block *super);
|
fix apparmor dereferencing potentially freed dentry, sanitize __d_path() API
__d_path() API is asking for trouble and in case of apparmor d_namespace_path()
getting just that. The root cause is that when __d_path() misses the root
it had been told to look for, it stores the location of the most remote ancestor
in *root. Without grabbing references. Sure, at the moment of call it had
been pinned down by what we have in *path. And if we raced with umount -l, we
could have very well stopped at vfsmount/dentry that got freed as soon as
prepend_path() dropped vfsmount_lock.
It is safe to compare these pointers with pre-existing (and known to be still
alive) vfsmount and dentry, as long as all we are asking is "is it the same
address?". Dereferencing is not safe and apparmor ended up stepping into
that. d_namespace_path() really wants to examine the place where we stopped,
even if it's not connected to our namespace. As the result, it looked
at ->d_sb->s_magic of a dentry that might've been already freed by that point.
All other callers had been careful enough to avoid that, but it's really
a bad interface - it invites that kind of trouble.
The fix is fairly straightforward, even though it's bigger than I'd like:
* prepend_path() root argument becomes const.
* __d_path() is never called with NULL/NULL root. It was a kludge
to start with. Instead, we have an explicit function - d_absolute_root().
Same as __d_path(), except that it doesn't get root passed and stops where
it stops. apparmor and tomoyo are using it.
* __d_path() returns NULL on path outside of root. The main
caller is show_mountinfo() and that's precisely what we pass root for - to
skip those outside chroot jail. Those who don't want that can (and do)
use d_path().
* __d_path() root argument becomes const. Everyone agrees, I hope.
* apparmor does *NOT* try to use __d_path() or any of its variants
when it sees that path->mnt is an internal vfsmount. In that case it's
definitely not mounted anywhere and dentry_path() is exactly what we want
there. Handling of sysctl()-triggered weirdness is moved to that place.
* if apparmor is asked to do pathname relative to chroot jail
and __d_path() tells it we it's not in that jail, the sucker just calls
d_absolute_path() instead. That's the other remaining caller of __d_path(),
BTW.
* seq_path_root() does _NOT_ return -ENAMETOOLONG (it's stupid anyway -
the normal seq_file logics will take care of growing the buffer and redoing
the call of ->show() just fine). However, if it gets path not reachable
from root, it returns SEQ_SKIP. The only caller adjusted (i.e. stopped
ignoring the return value as it used to do).
Reviewed-by: John Johansen <john.johansen@canonical.com>
ACKed-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
2011-12-05 20:43:34 +07:00
|
|
|
extern bool our_mnt(struct vfsmount *mnt);
|
2017-04-12 17:24:28 +07:00
|
|
|
extern __printf(2, 3)
|
|
|
|
int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
|
|
|
|
extern int super_setup_bdi(struct super_block *sb);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-03-30 06:08:22 +07:00
|
|
|
extern int current_umask(void);
|
|
|
|
|
2012-08-28 21:50:40 +07:00
|
|
|
extern void ihold(struct inode * inode);
|
|
|
|
extern void iput(struct inode *);
|
vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.
The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.
The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.
virtual patch
@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}
@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}
@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}
@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }
@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }
@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)
<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)
@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}
@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-05-09 09:36:02 +07:00
|
|
|
extern int generic_update_time(struct inode *, struct timespec64 *, int);
|
2012-08-28 21:50:40 +07:00
|
|
|
|
2006-01-17 13:14:23 +07:00
|
|
|
/* /sys/fs */
|
2007-10-30 03:17:23 +07:00
|
|
|
extern struct kobject *fs_kobj;
|
2006-01-17 13:14:23 +07:00
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 19:29:47 +07:00
|
|
|
#define MAX_RW_COUNT (INT_MAX & PAGE_MASK)
|
2008-08-06 20:12:22 +07:00
|
|
|
|
2015-11-16 21:49:34 +07:00
|
|
|
#ifdef CONFIG_MANDATORY_FILE_LOCKING
|
2014-03-10 20:54:15 +07:00
|
|
|
extern int locks_mandatory_locked(struct file *);
|
2015-12-03 18:59:49 +07:00
|
|
|
extern int locks_mandatory_area(struct inode *, struct file *, loff_t, loff_t, unsigned char);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Candidates for mandatory locking have the setgid bit set
|
|
|
|
* but no group execute bit - an otherwise meaningless combination.
|
|
|
|
*/
|
2007-10-02 04:41:11 +07:00
|
|
|
|
|
|
|
static inline int __mandatory_lock(struct inode *ino)
|
|
|
|
{
|
|
|
|
return (ino->i_mode & (S_ISGID | S_IXGRP)) == S_ISGID;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-07-17 14:45:35 +07:00
|
|
|
* ... and these candidates should be on SB_MANDLOCK mounted fs,
|
2007-10-02 04:41:11 +07:00
|
|
|
* otherwise these will be advisory locks
|
|
|
|
*/
|
|
|
|
|
|
|
|
static inline int mandatory_lock(struct inode *ino)
|
|
|
|
{
|
|
|
|
return IS_MANDLOCK(ino) && __mandatory_lock(ino);
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2014-03-10 20:54:15 +07:00
|
|
|
static inline int locks_verify_locked(struct file *file)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2016-09-16 17:44:20 +07:00
|
|
|
if (mandatory_lock(locks_inode(file)))
|
2014-03-10 20:54:15 +07:00
|
|
|
return locks_mandatory_locked(file);
|
2005-04-17 05:20:36 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int locks_verify_truncate(struct inode *inode,
|
2015-12-03 18:59:49 +07:00
|
|
|
struct file *f,
|
2005-04-17 05:20:36 +07:00
|
|
|
loff_t size)
|
|
|
|
{
|
2015-12-03 18:59:49 +07:00
|
|
|
if (!inode->i_flctx || !mandatory_lock(inode))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (size < inode->i_size) {
|
|
|
|
return locks_mandatory_area(inode, f, size, inode->i_size - 1,
|
|
|
|
F_WRLCK);
|
|
|
|
} else {
|
|
|
|
return locks_mandatory_area(inode, f, inode->i_size, size - 1,
|
|
|
|
F_WRLCK);
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2015-11-16 21:49:34 +07:00
|
|
|
#else /* !CONFIG_MANDATORY_FILE_LOCKING */
|
|
|
|
|
|
|
|
static inline int locks_mandatory_locked(struct file *file)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-01-13 07:30:34 +07:00
|
|
|
static inline int locks_mandatory_area(struct inode *inode, struct file *filp,
|
|
|
|
loff_t start, loff_t end, unsigned char type)
|
2015-11-16 21:49:34 +07:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int __mandatory_lock(struct inode *inode)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int mandatory_lock(struct inode *inode)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int locks_verify_locked(struct file *file)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int locks_verify_truncate(struct inode *inode, struct file *filp,
|
|
|
|
size_t size)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_MANDATORY_FILE_LOCKING */
|
|
|
|
|
|
|
|
|
|
|
|
#ifdef CONFIG_FILE_LOCKING
|
2005-04-17 05:20:36 +07:00
|
|
|
static inline int break_lease(struct inode *inode, unsigned int mode)
|
|
|
|
{
|
2014-02-04 00:13:06 +07:00
|
|
|
/*
|
|
|
|
* Since this check is lockless, we must ensure that any refcounts
|
2015-01-22 08:44:01 +07:00
|
|
|
* taken are done before checking i_flctx->flc_lease. Otherwise, we
|
|
|
|
* could end up racing with tasks trying to set a new lease on this
|
|
|
|
* file.
|
2014-02-04 00:13:06 +07:00
|
|
|
*/
|
|
|
|
smp_mb();
|
2015-01-17 03:05:55 +07:00
|
|
|
if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
|
2012-03-06 01:18:59 +07:00
|
|
|
return __break_lease(inode, mode, FL_LEASE);
|
2005-04-17 05:20:36 +07:00
|
|
|
return 0;
|
|
|
|
}
|
2012-03-06 01:18:59 +07:00
|
|
|
|
|
|
|
static inline int break_deleg(struct inode *inode, unsigned int mode)
|
|
|
|
{
|
2014-06-10 23:24:40 +07:00
|
|
|
/*
|
|
|
|
* Since this check is lockless, we must ensure that any refcounts
|
2015-01-22 08:44:01 +07:00
|
|
|
* taken are done before checking i_flctx->flc_lease. Otherwise, we
|
|
|
|
* could end up racing with tasks trying to set a new lease on this
|
|
|
|
* file.
|
2014-06-10 23:24:40 +07:00
|
|
|
*/
|
|
|
|
smp_mb();
|
2015-01-17 03:05:55 +07:00
|
|
|
if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
|
2012-03-06 01:18:59 +07:00
|
|
|
return __break_lease(inode, mode, FL_DELEG);
|
2005-04-17 05:20:36 +07:00
|
|
|
return 0;
|
|
|
|
}
|
2012-03-06 01:18:59 +07:00
|
|
|
|
2012-08-28 21:50:40 +07:00
|
|
|
static inline int try_break_deleg(struct inode *inode, struct inode **delegated_inode)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = break_deleg(inode, O_WRONLY|O_NONBLOCK);
|
|
|
|
if (ret == -EWOULDBLOCK && delegated_inode) {
|
|
|
|
*delegated_inode = inode;
|
|
|
|
ihold(inode);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int break_deleg_wait(struct inode **delegated_inode)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = break_deleg(*delegated_inode, O_WRONLY);
|
|
|
|
iput(*delegated_inode);
|
|
|
|
*delegated_inode = NULL;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2015-01-22 01:17:03 +07:00
|
|
|
static inline int break_layout(struct inode *inode, bool wait)
|
|
|
|
{
|
|
|
|
smp_mb();
|
|
|
|
if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
|
|
|
|
return __break_lease(inode,
|
|
|
|
wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
|
|
|
|
FL_LAYOUT);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-08-06 20:12:22 +07:00
|
|
|
#else /* !CONFIG_FILE_LOCKING */
|
2009-01-20 17:29:46 +07:00
|
|
|
static inline int break_lease(struct inode *inode, unsigned int mode)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-03-06 01:18:59 +07:00
|
|
|
static inline int break_deleg(struct inode *inode, unsigned int mode)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2012-08-28 21:50:40 +07:00
|
|
|
|
|
|
|
static inline int try_break_deleg(struct inode *inode, struct inode **delegated_inode)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int break_deleg_wait(struct inode **delegated_inode)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-01-22 01:17:03 +07:00
|
|
|
static inline int break_layout(struct inode *inode, bool wait)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-08-06 20:12:22 +07:00
|
|
|
#endif /* CONFIG_FILE_LOCKING */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/* fs/open.c */
|
2012-10-11 03:43:13 +07:00
|
|
|
struct audit_names;
|
2012-10-11 02:25:28 +07:00
|
|
|
struct filename {
|
2012-10-11 03:43:13 +07:00
|
|
|
const char *name; /* pointer to actual string */
|
|
|
|
const __user char *uptr; /* original userland pointer */
|
2015-01-22 12:00:23 +07:00
|
|
|
int refcnt;
|
2018-03-01 06:19:21 +07:00
|
|
|
struct audit_names *aname;
|
2015-02-23 08:07:13 +07:00
|
|
|
const char iname[];
|
2012-10-11 02:25:28 +07:00
|
|
|
};
|
2019-03-08 07:27:07 +07:00
|
|
|
static_assert(offsetof(struct filename, iname) % sizeof(long) == 0);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2016-03-26 01:24:09 +07:00
|
|
|
extern long vfs_truncate(const struct path *, loff_t);
|
2006-01-08 16:02:39 +07:00
|
|
|
extern int do_truncate(struct dentry *, loff_t start, unsigned int time_attrs,
|
|
|
|
struct file *filp);
|
2014-11-08 02:44:25 +07:00
|
|
|
extern int vfs_fallocate(struct file *file, int mode, loff_t offset,
|
2009-06-20 01:28:07 +07:00
|
|
|
loff_t len);
|
2007-10-20 08:16:18 +07:00
|
|
|
extern long do_sys_open(int dfd, const char __user *filename, int flags,
|
2011-11-22 02:59:34 +07:00
|
|
|
umode_t mode);
|
2012-10-11 03:43:10 +07:00
|
|
|
extern struct file *file_open_name(struct filename *, int, umode_t);
|
2011-11-22 02:59:34 +07:00
|
|
|
extern struct file *filp_open(const char *, int, umode_t);
|
2011-03-12 00:08:24 +07:00
|
|
|
extern struct file *file_open_root(struct dentry *, struct vfsmount *,
|
2016-03-23 04:25:36 +07:00
|
|
|
const char *, int, umode_t);
|
2012-06-27 00:58:53 +07:00
|
|
|
extern struct file * dentry_open(const struct path *, int, const struct cred *);
|
2018-07-12 22:18:42 +07:00
|
|
|
extern struct file * open_with_fake_path(const struct path *, int,
|
|
|
|
struct inode*, const struct cred *);
|
2018-06-08 22:19:32 +07:00
|
|
|
static inline struct file *file_clone_open(struct file *file)
|
|
|
|
{
|
|
|
|
return dentry_open(&file->f_path, file->f_flags, file->f_cred);
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
extern int filp_close(struct file *, fl_owner_t id);
|
2012-10-11 02:25:28 +07:00
|
|
|
|
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 07:57:29 +07:00
|
|
|
extern struct filename *getname_flags(const char __user *, int, int *);
|
2012-10-11 02:25:28 +07:00
|
|
|
extern struct filename *getname(const char __user *);
|
2014-02-06 03:54:53 +07:00
|
|
|
extern struct filename *getname_kernel(const char *);
|
2015-01-22 12:00:23 +07:00
|
|
|
extern void putname(struct filename *name);
|
2012-10-11 02:25:28 +07:00
|
|
|
|
2012-06-22 15:40:19 +07:00
|
|
|
extern int finish_open(struct file *file, struct dentry *dentry,
|
2018-06-08 22:44:56 +07:00
|
|
|
int (*open)(struct inode *, struct file *));
|
2012-06-10 17:48:09 +07:00
|
|
|
extern int finish_no_open(struct file *file, struct dentry *dentry);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/* fs/dcache.c */
|
|
|
|
extern void __init vfs_caches_init_early(void);
|
2015-08-07 05:46:20 +07:00
|
|
|
extern void __init vfs_caches_init(void);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2006-12-07 11:32:57 +07:00
|
|
|
extern struct kmem_cache *names_cachep;
|
|
|
|
|
2012-10-11 02:25:26 +07:00
|
|
|
#define __getname() kmem_cache_alloc(names_cachep, GFP_KERNEL)
|
2009-05-16 16:22:14 +07:00
|
|
|
#define __putname(name) kmem_cache_free(names_cachep, (void *)(name))
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-05-23 04:13:33 +07:00
|
|
|
extern struct super_block *blockdev_superblock;
|
|
|
|
static inline bool sb_is_blkdev_sb(struct super_block *sb)
|
|
|
|
{
|
2020-06-20 14:16:40 +07:00
|
|
|
return IS_ENABLED(CONFIG_BLOCK) && sb == blockdev_superblock;
|
2009-04-01 18:07:16 +07:00
|
|
|
}
|
2012-07-03 21:45:31 +07:00
|
|
|
|
2020-06-20 14:16:35 +07:00
|
|
|
void emergency_thaw_all(void);
|
2009-04-27 21:43:53 +07:00
|
|
|
extern int sync_filesystem(struct super_block *);
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 01:45:40 +07:00
|
|
|
extern const struct file_operations def_blk_fops;
|
2006-03-28 16:56:42 +07:00
|
|
|
extern const struct file_operations def_chr_fops;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/* fs/char_dev.c */
|
2017-06-16 03:05:21 +07:00
|
|
|
#define CHRDEV_MAJOR_MAX 512
|
2016-02-19 21:36:07 +07:00
|
|
|
/* Marks the bottom of the first segment of free char majors */
|
|
|
|
#define CHRDEV_MAJOR_DYN_END 234
|
2017-06-16 03:05:20 +07:00
|
|
|
/* Marks the top and bottom of the second segment of free char majors */
|
|
|
|
#define CHRDEV_MAJOR_DYN_EXT_START 511
|
|
|
|
#define CHRDEV_MAJOR_DYN_EXT_END 384
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
extern int alloc_chrdev_region(dev_t *, unsigned, unsigned, const char *);
|
|
|
|
extern int register_chrdev_region(dev_t, unsigned, const char *);
|
2009-08-06 16:13:23 +07:00
|
|
|
extern int __register_chrdev(unsigned int major, unsigned int baseminor,
|
|
|
|
unsigned int count, const char *name,
|
|
|
|
const struct file_operations *fops);
|
|
|
|
extern void __unregister_chrdev(unsigned int major, unsigned int baseminor,
|
|
|
|
unsigned int count, const char *name);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void unregister_chrdev_region(dev_t, unsigned);
|
2006-03-31 17:30:32 +07:00
|
|
|
extern void chrdev_show(struct seq_file *,off_t);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-08-06 16:13:23 +07:00
|
|
|
static inline int register_chrdev(unsigned int major, const char *name,
|
|
|
|
const struct file_operations *fops)
|
|
|
|
{
|
|
|
|
return __register_chrdev(major, 0, 256, name, fops);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void unregister_chrdev(unsigned int major, const char *name)
|
|
|
|
{
|
|
|
|
__unregister_chrdev(major, 0, 256, name);
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void init_special_inode(struct inode *, umode_t, dev_t);
|
|
|
|
|
|
|
|
/* Invalid inode operations -- fs/bad_inode.c */
|
|
|
|
extern void make_bad_inode(struct inode *);
|
2015-11-19 20:00:11 +07:00
|
|
|
extern bool is_bad_inode(struct inode *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
unsigned long invalidate_mapping_pages(struct address_space *mapping,
|
|
|
|
pgoff_t start, pgoff_t end);
|
2007-02-10 16:45:38 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
static inline void invalidate_remote_inode(struct inode *inode)
|
|
|
|
{
|
|
|
|
if (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
|
|
|
|
S_ISLNK(inode->i_mode))
|
2007-02-10 16:45:39 +07:00
|
|
|
invalidate_mapping_pages(inode->i_mapping, 0, -1);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
extern int invalidate_inode_pages2(struct address_space *mapping);
|
|
|
|
extern int invalidate_inode_pages2_range(struct address_space *mapping,
|
|
|
|
pgoff_t start, pgoff_t end);
|
|
|
|
extern int write_inode_now(struct inode *, int);
|
|
|
|
extern int filemap_fdatawrite(struct address_space *);
|
|
|
|
extern int filemap_flush(struct address_space *);
|
2017-07-06 18:02:22 +07:00
|
|
|
extern int filemap_fdatawait_keep_errors(struct address_space *mapping);
|
2009-08-18 00:30:27 +07:00
|
|
|
extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
|
|
|
|
loff_t lend);
|
2019-06-21 04:05:37 +07:00
|
|
|
extern int filemap_fdatawait_range_keep_errors(struct address_space *mapping,
|
|
|
|
loff_t start_byte, loff_t end_byte);
|
2017-07-31 21:29:38 +07:00
|
|
|
|
|
|
|
static inline int filemap_fdatawait(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return filemap_fdatawait_range(mapping, 0, LLONG_MAX);
|
|
|
|
}
|
|
|
|
|
2017-06-20 19:05:41 +07:00
|
|
|
extern bool filemap_range_has_page(struct address_space *, loff_t lstart,
|
|
|
|
loff_t lend);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern int filemap_write_and_wait_range(struct address_space *mapping,
|
|
|
|
loff_t lstart, loff_t lend);
|
[PATCH] fadvise(): write commands
Add two new linux-specific fadvise extensions():
LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
offsets `offset' and `offset+len'. Any pages which are currently under
writeout are skipped, whether or not they are dirty.
LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
offsets `offset' and `offset+len'.
By combining these two operations the application may do several things:
LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty
pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all
of the currently dirty pages at the disk, wait until they have been written.
It should be noted that none of these operations write out the file's
metadata. So unless the application is strictly performing overwrites of
already-instantiated disk blocks, there are no guarantees here that the data
will be available after a crash.
To complete this suite of operations I guess we should have a "sync file
metadata only" operation. This gives applications access to all the building
blocks needed for all sorts of sync operations. But sync-metadata doesn't fit
well with the fadvise() interface. Probably it should be a new syscall:
sys_fmetadatasync().
The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64().
It is made to represent that last affected byte in the file (ie: it is
inclusive). Generally, all these byterange and pagerange functions are
inclusive so we can easily represent EOF with -1.
As Ulrich notes, these two functions are somewhat abusive of the fadvise()
concept, which appears to be "set the future policy for this fd".
But these commands are a perfect fit with the fadvise() impementation, and
several of the existing fadvise() commands are synchronous and don't affect
future policy either. I think we can live with the slight incongruity.
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 18:18:04 +07:00
|
|
|
extern int __filemap_fdatawrite_range(struct address_space *mapping,
|
|
|
|
loff_t start, loff_t end, int sync_mode);
|
2008-07-12 06:27:31 +07:00
|
|
|
extern int filemap_fdatawrite_range(struct address_space *mapping,
|
|
|
|
loff_t start, loff_t end);
|
2016-07-29 19:10:57 +07:00
|
|
|
extern int filemap_check_errors(struct address_space *mapping);
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 18:02:25 +07:00
|
|
|
extern void __filemap_set_wb_err(struct address_space *mapping, int err);
|
2017-07-28 18:24:43 +07:00
|
|
|
|
2020-01-31 13:12:07 +07:00
|
|
|
static inline int filemap_write_and_wait(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
|
|
|
|
}
|
|
|
|
|
2017-07-28 18:24:43 +07:00
|
|
|
extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart,
|
|
|
|
loff_t lend);
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 18:02:25 +07:00
|
|
|
extern int __must_check file_check_and_advance_wb_err(struct file *file);
|
|
|
|
extern int __must_check file_write_and_wait_range(struct file *file,
|
|
|
|
loff_t start, loff_t end);
|
|
|
|
|
2017-07-28 18:24:43 +07:00
|
|
|
static inline int file_write_and_wait(struct file *file)
|
|
|
|
{
|
|
|
|
return file_write_and_wait_range(file, 0, LLONG_MAX);
|
|
|
|
}
|
|
|
|
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 18:02:25 +07:00
|
|
|
/**
|
|
|
|
* filemap_set_wb_err - set a writeback error on an address_space
|
|
|
|
* @mapping: mapping in which to set writeback error
|
|
|
|
* @err: error to be set in mapping
|
|
|
|
*
|
|
|
|
* When writeback fails in some way, we must record that error so that
|
|
|
|
* userspace can be informed when fsync and the like are called. We endeavor
|
|
|
|
* to report errors on any file that was open at the time of the error. Some
|
|
|
|
* internal callers also need to know when writeback errors have occurred.
|
|
|
|
*
|
|
|
|
* When a writeback error occurs, most filesystems will want to call
|
|
|
|
* filemap_set_wb_err to record the error in the mapping so that it will be
|
|
|
|
* automatically reported whenever fsync is called on the file.
|
|
|
|
*/
|
|
|
|
static inline void filemap_set_wb_err(struct address_space *mapping, int err)
|
|
|
|
{
|
|
|
|
/* Fastpath for common case of no error */
|
|
|
|
if (unlikely(err))
|
|
|
|
__filemap_set_wb_err(mapping, err);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2020-07-09 19:49:30 +07:00
|
|
|
* filemap_check_wb_err - has an error occurred since the mark was sampled?
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 18:02:25 +07:00
|
|
|
* @mapping: mapping to check for writeback errors
|
|
|
|
* @since: previously-sampled errseq_t
|
|
|
|
*
|
|
|
|
* Grab the errseq_t value from the mapping, and see if it has changed "since"
|
|
|
|
* the given value was sampled.
|
|
|
|
*
|
|
|
|
* If it has then report the latest error set, otherwise return 0.
|
|
|
|
*/
|
|
|
|
static inline int filemap_check_wb_err(struct address_space *mapping,
|
|
|
|
errseq_t since)
|
|
|
|
{
|
|
|
|
return errseq_check(&mapping->wb_err, since);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* filemap_sample_wb_err - sample the current errseq_t to test for later errors
|
|
|
|
* @mapping: mapping to be sampled
|
|
|
|
*
|
|
|
|
* Writeback errors are always reported relative to a particular sample point
|
|
|
|
* in the past. This function provides those sample points.
|
|
|
|
*/
|
|
|
|
static inline errseq_t filemap_sample_wb_err(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return errseq_sample(&mapping->wb_err);
|
|
|
|
}
|
|
|
|
|
vfs: track per-sb writeback errors and report them to syncfs
Patch series "vfs: have syncfs() return error when there are writeback
errors", v6.
Currently, syncfs does not return errors when one of the inodes fails to
be written back. It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.
The basic idea in this set is to track writeback errors at the
superblock level, so that we can quickly and easily check whether
something bad happened without having to fsync each file individually.
syncfs is then changed to reliably report writeback errors after they
occur, much in the same fashion as fsync does now.
This patch (of 2):
Usually we suggest that applications call fsync when they want to ensure
that all data written to the file has made it to the backing store, but
that can be inefficient when there are a lot of open files.
Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect. If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error. syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.
It would be better if syncfs reported an error if there were any
writeback failures. Then applications could call syncfs to see if there
are any errors on any open files, and could then call fsync on all of
the other descriptors to figure out which one failed.
This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.
To report those errors, we also need to keep an errseq_t in struct file
to act as a cursor. This patch adds a dedicated field for that purpose,
which slots nicely into 4 bytes of padding at the end of struct file on
x86_64.
An earlier version of this patch used an O_PATH file descriptor to cue
the kernel that the open file should track the superblock error and not
the inode's writeback error.
I think that API is just too weird though. This is simpler and should
make syncfs error reporting "just work" even if someone is multiplexing
fsync and syncfs on the same fds.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 11:45:36 +07:00
|
|
|
/**
|
|
|
|
* file_sample_sb_err - sample the current errseq_t to test for later errors
|
2020-06-23 14:09:03 +07:00
|
|
|
* @file: file pointer to be sampled
|
vfs: track per-sb writeback errors and report them to syncfs
Patch series "vfs: have syncfs() return error when there are writeback
errors", v6.
Currently, syncfs does not return errors when one of the inodes fails to
be written back. It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.
The basic idea in this set is to track writeback errors at the
superblock level, so that we can quickly and easily check whether
something bad happened without having to fsync each file individually.
syncfs is then changed to reliably report writeback errors after they
occur, much in the same fashion as fsync does now.
This patch (of 2):
Usually we suggest that applications call fsync when they want to ensure
that all data written to the file has made it to the backing store, but
that can be inefficient when there are a lot of open files.
Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect. If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error. syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.
It would be better if syncfs reported an error if there were any
writeback failures. Then applications could call syncfs to see if there
are any errors on any open files, and could then call fsync on all of
the other descriptors to figure out which one failed.
This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.
To report those errors, we also need to keep an errseq_t in struct file
to act as a cursor. This patch adds a dedicated field for that purpose,
which slots nicely into 4 bytes of padding at the end of struct file on
x86_64.
An earlier version of this patch used an O_PATH file descriptor to cue
the kernel that the open file should track the superblock error and not
the inode's writeback error.
I think that API is just too weird though. This is simpler and should
make syncfs error reporting "just work" even if someone is multiplexing
fsync and syncfs on the same fds.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 11:45:36 +07:00
|
|
|
*
|
|
|
|
* Grab the most current superblock-level errseq_t value for the given
|
|
|
|
* struct file.
|
|
|
|
*/
|
|
|
|
static inline errseq_t file_sample_sb_err(struct file *file)
|
|
|
|
{
|
|
|
|
return errseq_sample(&file->f_path.dentry->d_sb->s_wb_err);
|
|
|
|
}
|
|
|
|
|
2019-09-24 05:38:03 +07:00
|
|
|
static inline int filemap_nr_thps(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
|
|
|
|
return atomic_read(&mapping->nr_thps);
|
|
|
|
#else
|
|
|
|
return 0;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void filemap_nr_thps_inc(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
|
|
|
|
atomic_inc(&mapping->nr_thps);
|
|
|
|
#else
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void filemap_nr_thps_dec(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
|
|
|
|
atomic_dec(&mapping->nr_thps);
|
|
|
|
#else
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2010-03-22 23:32:25 +07:00
|
|
|
extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
|
|
|
|
int datasync);
|
|
|
|
extern int vfs_fsync(struct file *file, int datasync);
|
2016-04-07 22:52:01 +07:00
|
|
|
|
2019-04-10 03:51:48 +07:00
|
|
|
extern int sync_file_range(struct file *file, loff_t offset, loff_t nbytes,
|
|
|
|
unsigned int flags);
|
|
|
|
|
2016-04-07 22:52:01 +07:00
|
|
|
/*
|
|
|
|
* Sync the bytes written if this was a synchronous write. Expect ki_pos
|
|
|
|
* to already be updated for the write, and will return either the amount
|
|
|
|
* of bytes passed in, or an error if syncing the file failed.
|
|
|
|
*/
|
|
|
|
static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
|
|
|
|
{
|
|
|
|
if (iocb->ki_flags & IOCB_DSYNC) {
|
|
|
|
int ret = vfs_fsync_range(iocb->ki_filp,
|
|
|
|
iocb->ki_pos - count, iocb->ki_pos - 1,
|
|
|
|
(iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
return count;
|
2014-02-10 03:18:09 +07:00
|
|
|
}
|
2016-04-07 22:52:01 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void emergency_sync(void);
|
|
|
|
extern void emergency_remount(void);
|
2020-01-09 20:30:41 +07:00
|
|
|
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 01:45:40 +07:00
|
|
|
#ifdef CONFIG_BLOCK
|
2020-01-09 20:30:41 +07:00
|
|
|
extern int bmap(struct inode *inode, sector_t *block);
|
|
|
|
#else
|
|
|
|
static inline int bmap(struct inode *inode, sector_t *block)
|
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 01:45:40 +07:00
|
|
|
#endif
|
2020-01-09 20:30:41 +07:00
|
|
|
|
2011-09-21 04:19:26 +07:00
|
|
|
extern int notify_change(struct dentry *, struct iattr *, struct inode **);
|
2008-07-22 11:07:17 +07:00
|
|
|
extern int inode_permission(struct inode *, int);
|
2011-06-21 06:16:29 +07:00
|
|
|
extern int generic_permission(struct inode *, int);
|
2014-10-24 05:14:36 +07:00
|
|
|
extern int __check_sticky(struct inode *dir, struct inode *inode);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-07-31 18:41:58 +07:00
|
|
|
static inline bool execute_ok(struct inode *inode)
|
|
|
|
{
|
|
|
|
return (inode->i_mode & S_IXUGO) || S_ISDIR(inode->i_mode);
|
|
|
|
}
|
|
|
|
|
2013-03-20 08:01:03 +07:00
|
|
|
static inline void file_start_write(struct file *file)
|
|
|
|
{
|
|
|
|
if (!S_ISREG(file_inode(file)->i_mode))
|
|
|
|
return;
|
|
|
|
__sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
|
|
|
|
}
|
|
|
|
|
2013-05-04 05:11:23 +07:00
|
|
|
static inline bool file_start_write_trylock(struct file *file)
|
|
|
|
{
|
|
|
|
if (!S_ISREG(file_inode(file)->i_mode))
|
|
|
|
return true;
|
|
|
|
return __sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, false);
|
|
|
|
}
|
|
|
|
|
2013-03-20 08:01:03 +07:00
|
|
|
static inline void file_end_write(struct file *file)
|
|
|
|
{
|
|
|
|
if (!S_ISREG(file_inode(file)->i_mode))
|
|
|
|
return;
|
|
|
|
__sb_end_write(file_inode(file)->i_sb, SB_FREEZE_WRITE);
|
|
|
|
}
|
2017-01-31 15:34:57 +07:00
|
|
|
|
2011-06-20 21:52:57 +07:00
|
|
|
/*
|
|
|
|
* get_write_access() gets write permission for a file.
|
|
|
|
* put_write_access() releases this write permission.
|
|
|
|
* This is used for regular files.
|
|
|
|
* We cannot support write (and maybe mmap read-write shared) accesses and
|
|
|
|
* MAP_DENYWRITE mmappings simultaneously. The i_writecount field of an inode
|
|
|
|
* can have the following values:
|
|
|
|
* 0: no writers, no VM_DENYWRITE mappings
|
|
|
|
* < 0: (-i_writecount) vm_area_structs with VM_DENYWRITE set exist
|
|
|
|
* > 0: (i_writecount) users are writing to the file.
|
|
|
|
*
|
|
|
|
* Normally we operate on that counter with atomic_{inc,dec} and it's safe
|
|
|
|
* except for the cases where we don't hold i_writecount yet. Then we need to
|
|
|
|
* use {get,deny}_write_access() - these functions check the sign and refuse
|
|
|
|
* to do the change if sign is wrong.
|
|
|
|
*/
|
|
|
|
static inline int get_write_access(struct inode *inode)
|
|
|
|
{
|
|
|
|
return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
|
|
|
|
}
|
|
|
|
static inline int deny_write_access(struct file *file)
|
|
|
|
{
|
2013-01-24 05:07:38 +07:00
|
|
|
struct inode *inode = file_inode(file);
|
2011-06-20 21:52:57 +07:00
|
|
|
return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
static inline void put_write_access(struct inode * inode)
|
|
|
|
{
|
|
|
|
atomic_dec(&inode->i_writecount);
|
|
|
|
}
|
|
|
|
static inline void allow_write_access(struct file *file)
|
|
|
|
{
|
|
|
|
if (file)
|
2013-01-24 05:07:38 +07:00
|
|
|
atomic_inc(&file_inode(file)->i_writecount);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2013-09-27 17:20:03 +07:00
|
|
|
static inline bool inode_is_open_for_write(const struct inode *inode)
|
|
|
|
{
|
|
|
|
return atomic_read(&inode->i_writecount) > 0;
|
|
|
|
}
|
|
|
|
|
2019-06-07 21:24:38 +07:00
|
|
|
#if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
|
2010-11-02 21:11:37 +07:00
|
|
|
static inline void i_readcount_dec(struct inode *inode)
|
|
|
|
{
|
|
|
|
BUG_ON(!atomic_read(&inode->i_readcount));
|
|
|
|
atomic_dec(&inode->i_readcount);
|
|
|
|
}
|
|
|
|
static inline void i_readcount_inc(struct inode *inode)
|
|
|
|
{
|
|
|
|
atomic_inc(&inode->i_readcount);
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline void i_readcount_dec(struct inode *inode)
|
|
|
|
{
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
static inline void i_readcount_inc(struct inode *inode)
|
|
|
|
{
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
#endif
|
2008-07-24 11:29:30 +07:00
|
|
|
extern int do_pipe_flags(int *, int);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2016-04-21 05:46:27 +07:00
|
|
|
#define __kernel_read_file_id(id) \
|
|
|
|
id(UNKNOWN, unknown) \
|
|
|
|
id(FIRMWARE, firmware) \
|
2016-08-03 04:04:28 +07:00
|
|
|
id(FIRMWARE_PREALLOC_BUFFER, firmware) \
|
2020-01-15 23:35:48 +07:00
|
|
|
id(FIRMWARE_EFI_EMBEDDED, firmware) \
|
2016-04-21 05:46:27 +07:00
|
|
|
id(MODULE, kernel-module) \
|
|
|
|
id(KEXEC_IMAGE, kexec-image) \
|
|
|
|
id(KEXEC_INITRAMFS, kexec-initramfs) \
|
|
|
|
id(POLICY, security-policy) \
|
2017-09-10 14:49:45 +07:00
|
|
|
id(X509_CERTIFICATE, x509-certificate) \
|
2016-04-21 05:46:27 +07:00
|
|
|
id(MAX_ID, )
|
|
|
|
|
|
|
|
#define __fid_enumify(ENUM, dummy) READING_ ## ENUM,
|
|
|
|
#define __fid_stringify(dummy, str) #str,
|
|
|
|
|
2016-01-24 22:07:32 +07:00
|
|
|
enum kernel_read_file_id {
|
2016-04-21 05:46:27 +07:00
|
|
|
__kernel_read_file_id(__fid_enumify)
|
|
|
|
};
|
|
|
|
|
|
|
|
static const char * const kernel_read_file_str[] = {
|
|
|
|
__kernel_read_file_id(__fid_stringify)
|
2016-01-24 22:07:32 +07:00
|
|
|
};
|
|
|
|
|
2016-04-22 02:53:29 +07:00
|
|
|
static inline const char *kernel_read_file_id_str(enum kernel_read_file_id id)
|
2016-04-21 05:46:27 +07:00
|
|
|
{
|
2017-03-10 07:16:57 +07:00
|
|
|
if ((unsigned)id >= READING_MAX_ID)
|
2016-04-21 05:46:27 +07:00
|
|
|
return kernel_read_file_str[READING_UNKNOWN];
|
|
|
|
|
|
|
|
return kernel_read_file_str[id];
|
|
|
|
}
|
|
|
|
|
2016-01-24 22:07:32 +07:00
|
|
|
extern int kernel_read_file(struct file *, void **, loff_t *, loff_t,
|
|
|
|
enum kernel_read_file_id);
|
2017-09-13 09:45:33 +07:00
|
|
|
extern int kernel_read_file_from_path(const char *, void **, loff_t *, loff_t,
|
2015-11-20 00:39:22 +07:00
|
|
|
enum kernel_read_file_id);
|
firmware_loader: load files from the mount namespace of init
I have an experimental setup where almost every possible system
service (even early startup ones) runs in separate namespace, using a
dedicated, minimal file system. In process of minimizing the contents
of the file systems with regards to modules and firmware files, I
noticed that in my system, the firmware files are loaded from three
different mount namespaces, those of systemd-udevd, init and
systemd-networkd. The logic of the source namespace is not very clear,
it seems to depend on the driver, but the namespace of the current
process is used.
So, this patch tries to make things a bit clearer and changes the
loading of firmware files only from the mount namespace of init. This
may also improve security, though I think that using firmware files as
attack vector could be too impractical anyway.
Later, it might make sense to make the mount namespace configurable,
for example with a new file in /proc/sys/kernel/firmware_config/. That
would allow a dedicated file system only for firmware files and those
need not be present anywhere else. This configurability would make
more sense if made also for kernel modules and /sbin/modprobe. Modules
are already loaded from init namespace (usermodehelper uses kthreadd
namespace) except when directly loaded by systemd-udevd.
Instead of using the mount namespace of the current process to load
firmware files, use the mount namespace of init process.
Link: https://lore.kernel.org/lkml/bb46ebae-4746-90d9-ec5b-fce4c9328c86@gmail.com/
Link: https://lore.kernel.org/lkml/0e3f7653-c59d-9341-9db2-c88f5b988c68@gmail.com/
Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
Link: https://lore.kernel.org/r/20200123125839.37168-1-toiwoton@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-23 19:58:38 +07:00
|
|
|
extern int kernel_read_file_from_path_initns(const char *, void **, loff_t *, loff_t,
|
|
|
|
enum kernel_read_file_id);
|
2016-02-01 20:36:21 +07:00
|
|
|
extern int kernel_read_file_from_fd(int, void **, loff_t *, loff_t,
|
|
|
|
enum kernel_read_file_id);
|
2017-09-01 22:39:13 +07:00
|
|
|
extern ssize_t kernel_read(struct file *, void *, size_t, loff_t *);
|
2020-05-08 13:54:16 +07:00
|
|
|
ssize_t __kernel_read(struct file *file, void *buf, size_t count, loff_t *pos);
|
2017-09-01 22:39:14 +07:00
|
|
|
extern ssize_t kernel_write(struct file *, const void *, size_t, loff_t *);
|
2017-09-01 22:39:15 +07:00
|
|
|
extern ssize_t __kernel_write(struct file *, const void *, size_t, loff_t *);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern struct file * open_exec(const char *);
|
|
|
|
|
|
|
|
/* fs/dcache.c -- generic fs support functions */
|
2015-11-17 13:40:11 +07:00
|
|
|
extern bool is_subdir(struct dentry *, struct dentry *);
|
2016-11-15 04:14:35 +07:00
|
|
|
extern bool path_is_under(const struct path *, const struct path *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-06-19 15:29:13 +07:00
|
|
|
extern char *file_path(struct file *, char *, int);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/err.h>
|
|
|
|
|
|
|
|
/* needed for stackable file system support */
|
2012-12-18 06:59:39 +07:00
|
|
|
extern loff_t default_llseek(struct file *file, loff_t offset, int whence);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-12-18 06:59:39 +07:00
|
|
|
extern loff_t vfs_llseek(struct file *file, loff_t offset, int whence);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-08-08 00:38:25 +07:00
|
|
|
extern int inode_init_always(struct super_block *, struct inode *);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void inode_init_once(struct inode *);
|
2011-02-23 19:49:47 +07:00
|
|
|
extern void address_space_init_once(struct address_space *mapping);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern struct inode * igrab(struct inode *);
|
|
|
|
extern ino_t iunique(struct super_block *, ino_t);
|
|
|
|
extern int inode_needs_sync(struct inode *inode);
|
2010-06-08 00:43:19 +07:00
|
|
|
extern int generic_delete_inode(struct inode *inode);
|
2012-02-13 07:43:17 +07:00
|
|
|
static inline int generic_drop_inode(struct inode *inode)
|
|
|
|
{
|
2020-04-30 21:41:37 +07:00
|
|
|
return !inode->i_nlink || inode_unhashed(inode) ||
|
|
|
|
(inode->i_state & I_DONTCACHE);
|
2012-02-13 07:43:17 +07:00
|
|
|
}
|
2020-04-30 21:41:37 +07:00
|
|
|
extern void d_mark_dontcache(struct inode *inode);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2005-07-13 15:10:44 +07:00
|
|
|
extern struct inode *ilookup5_nowait(struct super_block *sb,
|
|
|
|
unsigned long hashval, int (*test)(struct inode *, void *),
|
|
|
|
void *data);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
|
|
|
|
int (*test)(struct inode *, void *), void *data);
|
|
|
|
extern struct inode *ilookup(struct super_block *sb, unsigned long ino);
|
|
|
|
|
2018-05-17 15:53:05 +07:00
|
|
|
extern struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
|
|
|
|
int (*test)(struct inode *, void *),
|
|
|
|
int (*set)(struct inode *, void *),
|
|
|
|
void *data);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
|
|
|
|
extern struct inode * iget_locked(struct super_block *, unsigned long);
|
2015-02-02 12:37:01 +07:00
|
|
|
extern struct inode *find_inode_nowait(struct super_block *,
|
|
|
|
unsigned long,
|
|
|
|
int (*match)(struct inode *,
|
|
|
|
unsigned long, void *),
|
|
|
|
void *data);
|
2017-12-01 18:40:16 +07:00
|
|
|
extern struct inode *find_inode_rcu(struct super_block *, unsigned long,
|
|
|
|
int (*)(struct inode *, void *), void *);
|
|
|
|
extern struct inode *find_inode_by_ino_rcu(struct super_block *, unsigned long);
|
2008-12-30 13:48:21 +07:00
|
|
|
extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
|
|
|
|
extern int insert_inode_locked(struct inode *);
|
lockdep: Add helper function for dir vs file i_mutex annotation
Purely in-memory filesystems do not use the inode hash as the dcache
tells us if an entry already exists. As a result, they do not call
unlock_new_inode, and thus directory inodes do not get put into a
different lockdep class for i_sem.
We need the different lockdep classes, because the locking order for
i_mutex is different for directory inodes and regular inodes. Directory
inodes can do "readdir()", which takes i_mutex *before* possibly taking
mm->mmap_sem (due to a page fault while copying the directory entry to
user space).
In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
before accessing i_mutex.
The two cases can never happen for the same inode, so no real deadlock
can occur, but without the different lockdep classes, lockdep cannot
understand that. As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
can lead to false positives from lockdep like below:
find/645 is trying to acquire lock:
(&mm->mmap_sem){++++++}, at: [<ffffffff81109514>] might_fault+0x5c/0xac
but task is already holding lock:
(&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffff81149f34>]
vfs_readdir+0x5b/0xb4
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
[<ffffffff8108ac26>] lock_acquire+0xbf/0x103
[<ffffffff814db822>] __mutex_lock_common+0x4c/0x361
[<ffffffff814dbc46>] mutex_lock_nested+0x40/0x45
[<ffffffff811daa87>] hugetlbfs_file_mmap+0x82/0x110
[<ffffffff81111557>] mmap_region+0x258/0x432
[<ffffffff811119dd>] do_mmap_pgoff+0x2ac/0x306
[<ffffffff81111b4f>] sys_mmap_pgoff+0x118/0x16a
[<ffffffff8100c858>] sys_mmap+0x22/0x24
[<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
-> #0 (&mm->mmap_sem){++++++}:
[<ffffffff8108a4bc>] __lock_acquire+0xa1a/0xcf7
[<ffffffff8108ac26>] lock_acquire+0xbf/0x103
[<ffffffff81109541>] might_fault+0x89/0xac
[<ffffffff81149cff>] filldir+0x6f/0xc7
[<ffffffff811586ea>] dcache_readdir+0x67/0x205
[<ffffffff81149f54>] vfs_readdir+0x7b/0xb4
[<ffffffff8114a073>] sys_getdents+0x7e/0xd1
[<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
This patch moves the directory vs file lockdep annotation into a helper
function that can be called by in-memory filesystems and has hugetlbfs
call it.
Signed-off-by: Josh Boyer <jwboyer@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25 18:48:12 +07:00
|
|
|
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
|
|
|
extern void lockdep_annotate_inode_mutex_key(struct inode *inode);
|
|
|
|
#else
|
|
|
|
static inline void lockdep_annotate_inode_mutex_key(struct inode *inode) { };
|
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void unlock_new_inode(struct inode *);
|
2018-06-29 02:53:17 +07:00
|
|
|
extern void discard_new_inode(struct inode *);
|
2010-10-23 22:19:54 +07:00
|
|
|
extern unsigned int get_next_ino(void);
|
2017-08-19 08:08:25 +07:00
|
|
|
extern void evict_inodes(struct super_block *sb);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
tmpfs: per-superblock i_ino support
Patch series "tmpfs: inode: Reduce risk of inum overflow", v7.
In Facebook production we are seeing heavy i_ino wraparounds on tmpfs. On
affected tiers, in excess of 10% of hosts show multiple files with
different content and the same inode number, with some servers even having
as many as 150 duplicated inode numbers with differing file content.
This causes actual, tangible problems in production. For example, we have
complaints from those working on remote caches that their application is
reporting cache corruptions because it uses (device, inodenum) to
establish the identity of a particular cache object, but because it's not
unique any more, the application refuses to continue and reports cache
corruption. Even worse, sometimes applications may not even detect the
corruption but may continue anyway, causing phantom and hard to debug
behaviour.
In general, userspace applications expect that (device, inodenum) should
be enough to be uniquely point to one inode, which seems fair enough. One
might also need to check the generation, but in this case:
1. That's not currently exposed to userspace
(ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
2. Even with generation, there shouldn't be two live inodes with the
same inode number on one device.
In order to mitigate this, we take a two-pronged approach:
1. Moving inum generation from being global to per-sb for tmpfs. This
itself allows some reduction in i_ino churn. This works on both 64-
and 32- bit machines.
2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
64-bit ino_t only: we allow users to mount tmpfs with a new inode64
option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.
You can see how this compares to previous related patches which didn't
implement this per-superblock:
- https://patchwork.kernel.org/patch/11254001/
- https://patchwork.kernel.org/patch/11023915/
This patch (of 2):
get_next_ino has a number of problems:
- It uses and returns a uint, which is susceptible to become overflowed
if a lot of volatile inodes that use get_next_ino are created.
- It's global, with no specificity per-sb or even per-filesystem. This
means it's not that difficult to cause inode number wraparounds on a
single device, which can result in having multiple distinct inodes
with the same inode number.
This patch adds a per-superblock counter that mitigates the second case.
This design also allows us to later have a specific i_ino size per-device,
for example, allowing users to choose whether to use 32- or 64-bit inodes
for each tmpfs mount. This is implemented in the next commit.
For internal shmem mounts which may be less tolerant to spinlock delays,
we implement a percpu batching scheme which only takes the stat_lock at
each batch boundary.
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/cover.1594661218.git.chris@chrisdown.name
Link: http://lkml.kernel.org/r/1986b9d63b986f08ec07a4aa4b2275e718e47d8a.1594661218.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 13:20:20 +07:00
|
|
|
/*
|
|
|
|
* Userspace may rely on the the inode number being non-zero. For example, glibc
|
|
|
|
* simply ignores files with zero i_ino in unlink() and other places.
|
|
|
|
*
|
|
|
|
* As an additional complication, if userspace was compiled with
|
|
|
|
* _FILE_OFFSET_BITS=32 on a 64-bit kernel we'll only end up reading out the
|
|
|
|
* lower 32 bits, so we need to check that those aren't zero explicitly. With
|
|
|
|
* _FILE_OFFSET_BITS=64, this may cause some harmless false-negatives, but
|
|
|
|
* better safe than sorry.
|
|
|
|
*/
|
|
|
|
static inline bool is_zero_ino(ino_t ino)
|
|
|
|
{
|
|
|
|
return (u32)ino == 0;
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void __iget(struct inode * inode);
|
2008-02-07 15:15:27 +07:00
|
|
|
extern void iget_failed(struct inode *);
|
2012-05-03 19:48:02 +07:00
|
|
|
extern void clear_inode(struct inode *);
|
2009-08-08 00:38:29 +07:00
|
|
|
extern void __destroy_inode(struct inode *);
|
2011-07-26 16:36:34 +07:00
|
|
|
extern struct inode *new_inode_pseudo(struct super_block *sb);
|
|
|
|
extern struct inode *new_inode(struct super_block *sb);
|
2011-01-07 13:49:50 +07:00
|
|
|
extern void free_inode_nonrcu(struct inode *inode);
|
2006-10-18 00:50:36 +07:00
|
|
|
extern int should_remove_suid(struct dentry *);
|
2015-05-21 21:05:53 +07:00
|
|
|
extern int file_remove_privs(struct file *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
extern void __insert_inode_hash(struct inode *, unsigned long hashval);
|
2010-10-23 18:15:32 +07:00
|
|
|
static inline void insert_inode_hash(struct inode *inode)
|
|
|
|
{
|
2005-04-17 05:20:36 +07:00
|
|
|
__insert_inode_hash(inode, inode->i_ino);
|
|
|
|
}
|
2011-07-28 11:41:09 +07:00
|
|
|
|
|
|
|
extern void __remove_inode_hash(struct inode *);
|
|
|
|
static inline void remove_inode_hash(struct inode *inode)
|
|
|
|
{
|
2015-03-12 19:19:11 +07:00
|
|
|
if (!inode_unhashed(inode) && !hlist_fake(&inode->i_hash))
|
2011-07-28 11:41:09 +07:00
|
|
|
__remove_inode_hash(inode);
|
|
|
|
}
|
|
|
|
|
2010-10-23 18:15:32 +07:00
|
|
|
extern void inode_sb_list_add(struct inode *inode);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
extern int sb_set_blocksize(struct super_block *, int);
|
|
|
|
extern int sb_min_blocksize(struct super_block *, int);
|
|
|
|
|
|
|
|
extern int generic_file_mmap(struct file *, struct vm_area_struct *);
|
|
|
|
extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
|
2015-04-09 23:55:47 +07:00
|
|
|
extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
|
2018-10-30 06:40:31 +07:00
|
|
|
extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
|
|
|
|
struct file *file_out, loff_t pos_out,
|
2018-10-30 06:41:49 +07:00
|
|
|
loff_t *count, unsigned int remap_flags);
|
2019-06-05 22:04:48 +07:00
|
|
|
extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
|
2019-06-05 22:04:49 +07:00
|
|
|
extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
|
|
|
|
struct file *file_out, loff_t pos_out,
|
|
|
|
size_t *count, unsigned int flags);
|
2019-08-31 00:09:24 +07:00
|
|
|
extern ssize_t generic_file_buffered_read(struct kiocb *iocb,
|
|
|
|
struct iov_iter *to, ssize_t already_read);
|
2014-03-06 10:53:04 +07:00
|
|
|
extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
|
2014-04-03 14:17:43 +07:00
|
|
|
extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
|
|
|
|
extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
|
2016-04-07 22:51:56 +07:00
|
|
|
extern ssize_t generic_file_direct_write(struct kiocb *, struct iov_iter *);
|
2014-02-12 09:34:08 +07:00
|
|
|
extern ssize_t generic_perform_write(struct file *, struct iov_iter *, loff_t);
|
2006-04-11 18:59:36 +07:00
|
|
|
|
2017-05-27 15:16:51 +07:00
|
|
|
ssize_t vfs_iter_read(struct file *file, struct iov_iter *iter, loff_t *ppos,
|
2017-07-06 23:58:37 +07:00
|
|
|
rwf_t flags);
|
2017-05-27 15:16:52 +07:00
|
|
|
ssize_t vfs_iter_write(struct file *file, struct iov_iter *iter, loff_t *ppos,
|
2017-07-06 23:58:37 +07:00
|
|
|
rwf_t flags);
|
2019-11-20 16:45:25 +07:00
|
|
|
ssize_t vfs_iocb_iter_read(struct file *file, struct kiocb *iocb,
|
|
|
|
struct iov_iter *iter);
|
|
|
|
ssize_t vfs_iocb_iter_write(struct file *file, struct kiocb *iocb,
|
|
|
|
struct iov_iter *iter);
|
2015-01-26 03:11:59 +07:00
|
|
|
|
2009-08-20 22:43:41 +07:00
|
|
|
/* fs/block_dev.c */
|
2014-09-29 21:21:10 +07:00
|
|
|
extern ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to);
|
2014-04-03 14:21:50 +07:00
|
|
|
extern ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from);
|
2011-07-17 07:44:56 +07:00
|
|
|
extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
|
|
|
|
int datasync);
|
2011-09-16 13:31:11 +07:00
|
|
|
extern void block_sync_page(struct page *page);
|
2009-08-20 22:43:41 +07:00
|
|
|
|
2006-04-11 18:59:36 +07:00
|
|
|
/* fs/splice.c */
|
2006-04-11 19:57:50 +07:00
|
|
|
extern ssize_t generic_file_splice_read(struct file *, loff_t *,
|
2006-04-11 18:59:36 +07:00
|
|
|
struct pipe_inode_info *, size_t, unsigned int);
|
2014-04-05 15:27:08 +07:00
|
|
|
extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
|
2006-04-11 19:57:50 +07:00
|
|
|
struct file *, loff_t *, size_t, unsigned int);
|
2006-04-11 18:59:36 +07:00
|
|
|
extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
|
2006-04-11 19:57:50 +07:00
|
|
|
struct file *out, loff_t *, size_t len, unsigned int flags);
|
2014-10-24 05:14:35 +07:00
|
|
|
extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
|
|
|
|
loff_t *opos, size_t len, unsigned int flags);
|
|
|
|
|
2006-04-11 18:59:36 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void
|
|
|
|
file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
|
2012-12-18 06:59:39 +07:00
|
|
|
extern loff_t noop_llseek(struct file *file, loff_t offset, int whence);
|
|
|
|
extern loff_t no_llseek(struct file *file, loff_t offset, int whence);
|
2013-06-25 11:02:13 +07:00
|
|
|
extern loff_t vfs_setpos(struct file *file, loff_t offset, loff_t maxsize);
|
2012-12-18 06:59:39 +07:00
|
|
|
extern loff_t generic_file_llseek(struct file *file, loff_t offset, int whence);
|
2011-09-16 06:06:50 +07:00
|
|
|
extern loff_t generic_file_llseek_size(struct file *file, loff_t offset,
|
2012-12-18 06:59:39 +07:00
|
|
|
int whence, loff_t maxsize, loff_t eof);
|
2013-06-16 23:27:42 +07:00
|
|
|
extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
|
|
|
|
int whence, loff_t size);
|
2015-12-06 10:04:48 +07:00
|
|
|
extern loff_t no_seek_end_llseek_size(struct file *, loff_t, int, loff_t);
|
|
|
|
extern loff_t no_seek_end_llseek(struct file *, loff_t, int);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern int generic_file_open(struct inode * inode, struct file * filp);
|
|
|
|
extern int nonseekable_open(struct inode * inode, struct file * filp);
|
fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
Commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") added
locking for file.f_pos access and in particular made concurrent read and
write not possible - now both those functions take f_pos lock for the
whole run, and so if e.g. a read is blocked waiting for data, write will
deadlock waiting for that read to complete.
This caused regression for stream-like files where previously read and
write could run simultaneously, but after that patch could not do so
anymore. See e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes
to /proc/xen/xenbus") which fixes such regression for particular case of
/proc/xen/xenbus.
The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
safety for read/write/lseek and added the locking to file descriptors of
all regular files. In 2014 that thread-safety problem was not new as it
was already discussed earlier in 2006.
However even though 2006'th version of Linus's patch was adding f_pos
locking "only for files that are marked seekable with FMODE_LSEEK (thus
avoiding the stream-like objects like pipes and sockets)", the 2014
version - the one that actually made it into the tree as 9c225f2655e3 -
is doing so irregardless of whether a file is seekable or not.
See
https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
https://lwn.net/Articles/180387
https://lwn.net/Articles/180396
for historic context.
The reason that it did so is, probably, that there are many files that
are marked non-seekable, but e.g. their read implementation actually
depends on knowing current position to correctly handle the read. Some
examples:
kernel/power/user.c snapshot_read
fs/debugfs/file.c u32_array_read
fs/fuse/control.c fuse_conn_waiting_read + ...
drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read
arch/s390/hypfs/inode.c hypfs_read_iter
...
Despite that, many nonseekable_open users implement read and write with
pure stream semantics - they don't depend on passed ppos at all. And for
those cases where read could wait for something inside, it creates a
situation similar to xenbus - the write could be never made to go until
read is done, and read is waiting for some, potentially external, event,
for potentially unbounded time -> deadlock.
Besides xenbus, there are 14 such places in the kernel that I've found
with semantic patch (see below):
drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
In addition to the cases above another regression caused by f_pos
locking is that now FUSE filesystems that implement open with
FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
stream-like files - for the same reason as above e.g. read can deadlock
write locking on file.f_pos in the kernel.
FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f715 ("fuse:
implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
write routines not depending on current position at all, and with both
read and write being potentially blocking operations:
See
https://github.com/libfuse/osspd
https://lwn.net/Articles/308445
https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
"somewhat pipe-like files ..." with read handler not using offset.
However that test implements only read without write and cannot exercise
the deadlock scenario:
https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
I've actually hit the read vs write deadlock for real while implementing
my FUSE filesystem where there is /head/watch file, for which open
creates separate bidirectional socket-like stream in between filesystem
and its user with both read and write being later performed
simultaneously. And there it is semantically not easy to split the
stream into two separate read-only and write-only channels:
https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
Let's fix this regression. The plan is:
1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
doing so would break many in-kernel nonseekable_open users which
actually use ppos in read/write handlers.
2. Add stream_open() to kernel to open stream-like non-seekable file
descriptors. Read and write on such file descriptors would never use
nor change ppos. And with that property on stream-like files read and
write will be running without taking f_pos lock - i.e. read and write
could be running simultaneously.
3. With semantic patch search and convert to stream_open all in-kernel
nonseekable_open users for which read and write actually do not
depend on ppos and where there is no other methods in file_operations
which assume @offset access.
4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
steam_open if that bit is present in filesystem open reply.
It was tempting to change fs/fuse/ open handler to use stream_open
instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
and in particular GVFS which actually uses offset in its read and
write handlers
https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
so if we would do such a change it will break a real user.
5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
from v3.14+ (the kernel where 9c225f2655 first appeared).
This will allow to patch OSSPD and other FUSE filesystems that
provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
in their open handler and this way avoid the deadlock on all kernel
versions. This should work because fs/fuse/ ignores unknown open
flags returned from a filesystem and so passing FOPEN_STREAM to a
kernel that is not aware of this flag cannot hurt. In turn the kernel
that is not aware of FOPEN_STREAM will be < v3.14 where just
FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
write deadlock.
This patch adds stream_open, converts /proc/xen/xenbus to it and adds
semantic patch to automatically locate in-kernel places that are either
required to be converted due to read vs write deadlock, or that are just
safe to be converted because read and write do not use ppos and there
are no other funky methods in file_operations.
Regarding semantic patch I've verified each generated change manually -
that it is correct to convert - and each other nonseekable_open instance
left - that it is either not correct to convert there, or that it is not
converted due to current stream_open.cocci limitations.
The script also does not convert files that should be valid to convert,
but that currently have .llseek = noop_llseek or generic_file_llseek for
unknown reason despite file being opened with nonseekable_open (e.g.
drivers/input/mousedev.c)
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-27 05:20:43 +07:00
|
|
|
extern int stream_open(struct inode * inode, struct file * filp);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 01:45:40 +07:00
|
|
|
#ifdef CONFIG_BLOCK
|
2016-06-06 02:31:50 +07:00
|
|
|
typedef void (dio_submit_t)(struct bio *bio, struct inode *inode,
|
2010-05-23 22:00:55 +07:00
|
|
|
loff_t file_offset);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
enum {
|
cleanup blockdev_direct_IO locking
Currently the locking in blockdev_direct_IO is a mess, we have three different
locking types and very confusing checks for some of them. The most
complicated one is DIO_OWN_LOCKING for reads, which happens to not actually be
used.
This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read case
is unused anyway, and the write side is almost identical to DIO_NO_LOCKING.
The difference is that DIO_NO_LOCKING always sets the create argument for
the get_blocks callback to zero, but we can easily move that to the actual
get_blocks callbacks. There are four users of the DIO_NO_LOCKING mode:
gfs already ignores the create argument and thus is fine with the new
version, ocfs2 only errors out if create were ever set, and we can remove
this dead code now, the block device code only ever uses create for an
error message if we are fully beyond the device which can never happen,
and last but not least XFS will need the new behavour for writes.
Now we can replace the lock_type variable with a flags one, where no flag
means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
flag. Separate out the check for not allowing to fill holes into a separate
flag, although for now both flags always get set at the same time.
Also revamp the documentation of the locking scheme to actually make sense.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-11-03 22:44:53 +07:00
|
|
|
/* need locking between buffered and direct access */
|
|
|
|
DIO_LOCKING = 0x01,
|
|
|
|
|
|
|
|
/* filesystem does not support filling holes */
|
|
|
|
DIO_SKIP_HOLES = 0x02,
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
2020-06-10 00:22:47 +07:00
|
|
|
void dio_end_io(struct bio *bio);
|
|
|
|
|
2015-03-16 18:33:50 +07:00
|
|
|
ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
|
|
|
|
struct block_device *bdev, struct iov_iter *iter,
|
2016-04-07 22:51:58 +07:00
|
|
|
get_block_t get_block,
|
2015-03-16 18:33:50 +07:00
|
|
|
dio_iodone_t end_io, dio_submit_t submit_io,
|
|
|
|
int flags);
|
fs: introduce new truncate sequence
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.
simple_setattr is introduced for simple in-ram filesystems to implement
the new truncate sequence. Eventually all filesystems should be converted
to implement a setattr, and the default code in notify_change should go
away.
simple_setsize is also introduced to perform just the ATTR_SIZE portion
of simple_setattr (ie. changing i_size and trimming pagecache).
To implement the new truncate sequence:
- filesystem specific manipulations (eg freeing blocks) must be done in
the setattr method rather than ->truncate.
- vmtruncate can not be used by core code to trim blocks past i_size in
the event of write failure after allocation, so this must be performed
in the fs code.
- convert usage of helpers block_write_begin, nobh_write_begin,
cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
variants. These avoid calling vmtruncate to trim blocks (see previous).
- inode_setattr should not be used. generic_setattr is a new function
to be used to copy simple attributes into the generic inode.
- make use of the better opportunity to handle errors with the new sequence.
Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed. This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-26 22:05:33 +07:00
|
|
|
|
2015-03-16 18:33:50 +07:00
|
|
|
static inline ssize_t blockdev_direct_IO(struct kiocb *iocb,
|
|
|
|
struct inode *inode,
|
2016-04-07 22:51:58 +07:00
|
|
|
struct iov_iter *iter,
|
2015-03-16 18:33:50 +07:00
|
|
|
get_block_t get_block)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2015-03-16 18:33:50 +07:00
|
|
|
return __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter,
|
2016-04-07 22:51:58 +07:00
|
|
|
get_block, NULL, NULL, DIO_LOCKING | DIO_SKIP_HOLES);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 01:45:40 +07:00
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-05-31 23:22:33 +07:00
|
|
|
void inode_dio_wait(struct inode *inode);
|
direct-io: only inc/dec inode->i_dio_count for file systems
do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.
For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:
clat percentiles (usec):
| 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34],
| 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35],
| 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80],
| 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155],
| 99.99th=[ 165]
After:
clat percentiles (usec):
| 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149],
| 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171],
| 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270],
| 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422],
| 99.99th=[ 438]
In other setups, Robert Elliott reported seeing good performance
improvements:
https://lkml.org/lkml/2015/4/3/557
The more applications accessing the device, the worse it gets.
Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-16 06:05:48 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* inode_dio_begin - signal start of a direct I/O requests
|
|
|
|
* @inode: inode the direct I/O happens on
|
|
|
|
*
|
|
|
|
* This is called once we've finished processing a direct I/O request,
|
|
|
|
* and is used to wake up callers waiting for direct I/O to be quiesced.
|
|
|
|
*/
|
|
|
|
static inline void inode_dio_begin(struct inode *inode)
|
|
|
|
{
|
|
|
|
atomic_inc(&inode->i_dio_count);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* inode_dio_end - signal finish of a direct I/O requests
|
|
|
|
* @inode: inode the direct I/O happens on
|
|
|
|
*
|
|
|
|
* This is called once we've finished processing a direct I/O request,
|
|
|
|
* and is used to wake up callers waiting for direct I/O to be quiesced.
|
|
|
|
*/
|
|
|
|
static inline void inode_dio_end(struct inode *inode)
|
|
|
|
{
|
|
|
|
if (atomic_dec_and_test(&inode->i_dio_count))
|
|
|
|
wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
|
|
|
|
}
|
2012-05-31 23:22:33 +07:00
|
|
|
|
2019-12-01 08:49:44 +07:00
|
|
|
/*
|
|
|
|
* Warn about a page cache invalidation failure diring a direct I/O write.
|
|
|
|
*/
|
|
|
|
void dio_warn_stale_pagecache(struct file *filp);
|
|
|
|
|
2014-03-25 01:43:12 +07:00
|
|
|
extern void inode_set_flags(struct inode *inode, unsigned int flags,
|
|
|
|
unsigned int mask);
|
|
|
|
|
2006-03-28 16:56:42 +07:00
|
|
|
extern const struct file_operations generic_ro_fops;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
#define special_file(m) (S_ISCHR(m)||S_ISBLK(m)||S_ISFIFO(m)||S_ISSOCK(m))
|
|
|
|
|
2014-03-15 00:42:45 +07:00
|
|
|
extern int readlink_copy(char __user *, int, const char *);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern int page_readlink(struct dentry *, char __user *, int);
|
2015-12-30 03:58:39 +07:00
|
|
|
extern const char *page_get_link(struct dentry *, struct inode *,
|
|
|
|
struct delayed_call *);
|
|
|
|
extern void page_put_link(void *);
|
2006-03-11 18:27:13 +07:00
|
|
|
extern int __page_symlink(struct inode *inode, const char *symname, int len,
|
fs: symlink write_begin allocation context fix
With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened. They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim. This bug could
cause filesystem deadlocks.
The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock. The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.
Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).
This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
random example).
[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Cleaned up the calling convention: just pass in the AOP flags
untouched to the grab_cache_page_write_begin() function. That
just simplifies everybody, and may even allow future expansion of the
logic. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-05 03:00:53 +07:00
|
|
|
int nofs);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern int page_symlink(struct inode *inode, const char *symname, int len);
|
2007-02-12 15:55:40 +07:00
|
|
|
extern const struct inode_operations page_symlink_inode_operations;
|
2015-12-30 03:58:39 +07:00
|
|
|
extern void kfree_link(void *);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void generic_fillattr(struct inode *, struct kstat *);
|
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-31 23:46:22 +07:00
|
|
|
extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
|
|
|
|
extern int vfs_getattr(const struct path *, struct kstat *, u32, unsigned int);
|
2009-12-14 19:21:12 +07:00
|
|
|
void __inode_add_bytes(struct inode *inode, loff_t bytes);
|
2005-04-17 05:20:36 +07:00
|
|
|
void inode_add_bytes(struct inode *inode, loff_t bytes);
|
2013-08-17 20:32:32 +07:00
|
|
|
void __inode_sub_bytes(struct inode *inode, loff_t bytes);
|
2005-04-17 05:20:36 +07:00
|
|
|
void inode_sub_bytes(struct inode *inode, loff_t bytes);
|
2017-08-08 14:54:36 +07:00
|
|
|
static inline loff_t __inode_get_bytes(struct inode *inode)
|
|
|
|
{
|
|
|
|
return (((loff_t)inode->i_blocks) << 9) + inode->i_bytes;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
loff_t inode_get_bytes(struct inode *inode);
|
|
|
|
void inode_set_bytes(struct inode *inode, loff_t bytes);
|
2015-12-30 03:58:39 +07:00
|
|
|
const char *simple_get_link(struct dentry *, struct inode *,
|
|
|
|
struct delayed_call *);
|
2015-05-02 20:54:06 +07:00
|
|
|
extern const struct inode_operations simple_symlink_inode_operations;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2013-05-16 00:52:59 +07:00
|
|
|
extern int iterate_dir(struct file *, struct dir_context *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-31 23:46:22 +07:00
|
|
|
extern int vfs_statx(int, const char __user *, int, struct kstat *, u32);
|
|
|
|
extern int vfs_statx_fd(unsigned int, struct kstat *, u32, unsigned int);
|
|
|
|
|
|
|
|
static inline int vfs_stat(const char __user *filename, struct kstat *stat)
|
|
|
|
{
|
2017-05-05 05:30:16 +07:00
|
|
|
return vfs_statx(AT_FDCWD, filename, AT_NO_AUTOMOUNT,
|
|
|
|
stat, STATX_BASIC_STATS);
|
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-31 23:46:22 +07:00
|
|
|
}
|
|
|
|
static inline int vfs_lstat(const char __user *name, struct kstat *stat)
|
|
|
|
{
|
2017-05-05 05:30:16 +07:00
|
|
|
return vfs_statx(AT_FDCWD, name, AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT,
|
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-31 23:46:22 +07:00
|
|
|
stat, STATX_BASIC_STATS);
|
|
|
|
}
|
|
|
|
static inline int vfs_fstatat(int dfd, const char __user *filename,
|
|
|
|
struct kstat *stat, int flags)
|
|
|
|
{
|
2017-11-30 07:11:26 +07:00
|
|
|
return vfs_statx(dfd, filename, flags | AT_NO_AUTOMOUNT,
|
|
|
|
stat, STATX_BASIC_STATS);
|
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-31 23:46:22 +07:00
|
|
|
}
|
|
|
|
static inline int vfs_fstat(int fd, struct kstat *stat)
|
|
|
|
{
|
|
|
|
return vfs_statx_fd(fd, stat, STATX_BASIC_STATS, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2016-10-04 19:40:45 +07:00
|
|
|
extern const char *vfs_get_link(struct dentry *, struct delayed_call *);
|
2016-12-09 22:45:04 +07:00
|
|
|
extern int vfs_readlink(struct dentry *, char __user *, int);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2017-07-04 23:25:16 +07:00
|
|
|
extern struct file_system_type *get_filesystem(struct file_system_type *fs);
|
2007-10-19 13:39:11 +07:00
|
|
|
extern void put_filesystem(struct file_system_type *fs);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern struct file_system_type *get_fs_type(const char *name);
|
|
|
|
extern struct super_block *get_super(struct block_device *);
|
2012-02-10 17:03:00 +07:00
|
|
|
extern struct super_block *get_super_thawed(struct block_device *);
|
2016-11-23 18:53:00 +07:00
|
|
|
extern struct super_block *get_super_exclusive_thawed(struct block_device *bdev);
|
2009-08-04 04:28:35 +07:00
|
|
|
extern struct super_block *get_active_super(struct block_device *bdev);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void drop_super(struct super_block *sb);
|
2016-11-23 18:53:00 +07:00
|
|
|
extern void drop_super_exclusive(struct super_block *sb);
|
2010-03-23 17:06:58 +07:00
|
|
|
extern void iterate_supers(void (*)(struct super_block *, void *), void *);
|
2011-06-04 07:16:57 +07:00
|
|
|
extern void iterate_supers_type(struct file_system_type *,
|
|
|
|
void (*)(struct super_block *, void *), void *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
extern int dcache_dir_open(struct inode *, struct file *);
|
|
|
|
extern int dcache_dir_close(struct inode *, struct file *);
|
|
|
|
extern loff_t dcache_dir_lseek(struct file *, loff_t, int);
|
2013-05-16 07:23:06 +07:00
|
|
|
extern int dcache_readdir(struct file *, struct dir_context *);
|
fs: introduce new truncate sequence
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.
simple_setattr is introduced for simple in-ram filesystems to implement
the new truncate sequence. Eventually all filesystems should be converted
to implement a setattr, and the default code in notify_change should go
away.
simple_setsize is also introduced to perform just the ATTR_SIZE portion
of simple_setattr (ie. changing i_size and trimming pagecache).
To implement the new truncate sequence:
- filesystem specific manipulations (eg freeing blocks) must be done in
the setattr method rather than ->truncate.
- vmtruncate can not be used by core code to trim blocks past i_size in
the event of write failure after allocation, so this must be performed
in the fs code.
- convert usage of helpers block_write_begin, nobh_write_begin,
cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
variants. These avoid calling vmtruncate to trim blocks (see previous).
- inode_setattr should not be used. generic_setattr is a new function
to be used to copy simple attributes into the generic inode.
- make use of the better opportunity to handle errors with the new sequence.
Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed. This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-26 22:05:33 +07:00
|
|
|
extern int simple_setattr(struct dentry *, struct iattr *);
|
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-31 23:46:22 +07:00
|
|
|
extern int simple_getattr(const struct path *, struct kstat *, u32, unsigned int);
|
2006-06-23 16:02:58 +07:00
|
|
|
extern int simple_statfs(struct dentry *, struct kstatfs *);
|
2012-04-06 04:25:09 +07:00
|
|
|
extern int simple_open(struct inode *inode, struct file *file);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern int simple_link(struct dentry *, struct inode *, struct dentry *);
|
|
|
|
extern int simple_unlink(struct inode *, struct dentry *);
|
|
|
|
extern int simple_rmdir(struct inode *, struct dentry *);
|
2016-09-27 16:03:57 +07:00
|
|
|
extern int simple_rename(struct inode *, struct dentry *,
|
|
|
|
struct inode *, struct dentry *, unsigned int);
|
2019-11-18 21:43:10 +07:00
|
|
|
extern void simple_recursive_removal(struct dentry *,
|
|
|
|
void (*callback)(struct dentry *));
|
2011-07-17 07:44:56 +07:00
|
|
|
extern int noop_fsync(struct file *, loff_t, loff_t, int);
|
2018-03-08 06:26:44 +07:00
|
|
|
extern int noop_set_page_dirty(struct page *page);
|
|
|
|
extern void noop_invalidatepage(struct page *page, unsigned int offset,
|
|
|
|
unsigned int length);
|
|
|
|
extern ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern int simple_empty(struct dentry *);
|
|
|
|
extern int simple_readpage(struct file *file, struct page *page);
|
2007-10-16 15:25:01 +07:00
|
|
|
extern int simple_write_begin(struct file *file, struct address_space *mapping,
|
|
|
|
loff_t pos, unsigned len, unsigned flags,
|
|
|
|
struct page **pagep, void **fsdata);
|
|
|
|
extern int simple_write_end(struct file *file, struct address_space *mapping,
|
|
|
|
loff_t pos, unsigned len, unsigned copied,
|
|
|
|
struct page *page, void *fsdata);
|
2013-10-26 05:47:37 +07:00
|
|
|
extern int always_delete_dentry(const struct dentry *);
|
2013-10-03 09:35:11 +07:00
|
|
|
extern struct inode *alloc_anon_inode(struct super_block *);
|
2014-08-22 21:40:25 +07:00
|
|
|
extern int simple_nosetlease(struct file *, long, struct file_lock **, void **);
|
2013-10-26 05:47:37 +07:00
|
|
|
extern const struct dentry_operations simple_dentry_operations;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-06-11 04:13:09 +07:00
|
|
|
extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned int flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *);
|
2006-03-28 16:56:42 +07:00
|
|
|
extern const struct file_operations simple_dir_operations;
|
2007-02-12 15:55:40 +07:00
|
|
|
extern const struct inode_operations simple_dir_inode_operations;
|
2015-05-10 03:54:49 +07:00
|
|
|
extern void make_empty_dir_inode(struct inode *inode);
|
|
|
|
extern bool is_empty_dir_inode(struct inode *inode);
|
2017-03-26 11:15:37 +07:00
|
|
|
struct tree_descr { const char *name; const struct file_operations *ops; int mode; };
|
2005-04-17 05:20:36 +07:00
|
|
|
struct dentry *d_alloc_name(struct dentry *, const char *);
|
2017-03-26 11:15:37 +07:00
|
|
|
extern int simple_fill_super(struct super_block *, unsigned long,
|
|
|
|
const struct tree_descr *);
|
2006-06-09 20:34:16 +07:00
|
|
|
extern int simple_pin_fs(struct file_system_type *, struct vfsmount **mount, int *count);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void simple_release_fs(struct vfsmount **mount, int *count);
|
|
|
|
|
2008-06-06 12:46:21 +07:00
|
|
|
extern ssize_t simple_read_from_buffer(void __user *to, size_t count,
|
|
|
|
loff_t *ppos, const void *from, size_t available);
|
2010-05-02 04:51:22 +07:00
|
|
|
extern ssize_t simple_write_to_buffer(void *to, size_t available, loff_t *ppos,
|
|
|
|
const void __user *from, size_t count);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2014-06-05 06:06:27 +07:00
|
|
|
extern int __generic_file_fsync(struct file *, loff_t, loff_t, int);
|
2011-07-17 07:44:56 +07:00
|
|
|
extern int generic_file_fsync(struct file *, loff_t, loff_t, int);
|
2009-06-08 01:56:44 +07:00
|
|
|
|
2010-07-23 05:03:41 +07:00
|
|
|
extern int generic_check_addressable(unsigned, u64);
|
|
|
|
|
2006-02-01 18:05:41 +07:00
|
|
|
#ifdef CONFIG_MIGRATION
|
2006-06-23 16:03:33 +07:00
|
|
|
extern int buffer_migrate_page(struct address_space *,
|
2012-01-13 08:19:43 +07:00
|
|
|
struct page *, struct page *,
|
|
|
|
enum migrate_mode);
|
2018-12-28 15:39:12 +07:00
|
|
|
extern int buffer_migrate_page_norefs(struct address_space *,
|
|
|
|
struct page *, struct page *,
|
|
|
|
enum migrate_mode);
|
2006-02-01 18:05:41 +07:00
|
|
|
#else
|
|
|
|
#define buffer_migrate_page NULL
|
2018-12-28 15:39:12 +07:00
|
|
|
#define buffer_migrate_page_norefs NULL
|
2006-02-01 18:05:41 +07:00
|
|
|
#endif
|
|
|
|
|
2016-05-26 21:55:18 +07:00
|
|
|
extern int setattr_prepare(struct dentry *, struct iattr *);
|
2009-08-20 23:35:05 +07:00
|
|
|
extern int inode_newsize_ok(const struct inode *, loff_t offset);
|
2010-06-04 16:30:00 +07:00
|
|
|
extern void setattr_copy(struct inode *inode, const struct iattr *attr);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-03-26 20:59:21 +07:00
|
|
|
extern int file_update_time(struct file *file);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2020-03-25 00:46:48 +07:00
|
|
|
static inline bool vma_is_dax(const struct vm_area_struct *vma)
|
2017-07-11 05:48:25 +07:00
|
|
|
{
|
|
|
|
return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
|
|
|
|
}
|
|
|
|
|
2017-11-30 07:10:35 +07:00
|
|
|
static inline bool vma_is_fsdax(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
struct inode *inode;
|
|
|
|
|
|
|
|
if (!vma->vm_file)
|
|
|
|
return false;
|
|
|
|
if (!vma_is_dax(vma))
|
|
|
|
return false;
|
|
|
|
inode = file_inode(vma->vm_file);
|
2018-02-22 08:08:01 +07:00
|
|
|
if (S_ISCHR(inode->i_mode))
|
2017-11-30 07:10:35 +07:00
|
|
|
return false; /* device-dax */
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2015-04-10 00:52:01 +07:00
|
|
|
static inline int iocb_flags(struct file *file)
|
|
|
|
{
|
|
|
|
int res = 0;
|
|
|
|
if (file->f_flags & O_APPEND)
|
|
|
|
res |= IOCB_APPEND;
|
2020-04-30 21:41:33 +07:00
|
|
|
if (file->f_flags & O_DIRECT)
|
2015-04-10 00:52:01 +07:00
|
|
|
res |= IOCB_DIRECT;
|
2016-04-07 22:52:00 +07:00
|
|
|
if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))
|
|
|
|
res |= IOCB_DSYNC;
|
|
|
|
if (file->f_flags & __O_SYNC)
|
|
|
|
res |= IOCB_SYNC;
|
2015-04-10 00:52:01 +07:00
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
2017-07-06 23:58:37 +07:00
|
|
|
static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
|
2017-06-20 19:05:40 +07:00
|
|
|
{
|
2020-08-01 17:36:33 +07:00
|
|
|
int kiocb_flags = 0;
|
|
|
|
|
|
|
|
if (!flags)
|
|
|
|
return 0;
|
2017-06-20 19:05:40 +07:00
|
|
|
if (unlikely(flags & ~RWF_SUPPORTED))
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
2017-06-20 19:05:43 +07:00
|
|
|
if (flags & RWF_NOWAIT) {
|
2017-08-29 21:13:20 +07:00
|
|
|
if (!(ki->ki_filp->f_mode & FMODE_NOWAIT))
|
2017-06-20 19:05:43 +07:00
|
|
|
return -EOPNOTSUPP;
|
2020-08-01 17:36:33 +07:00
|
|
|
kiocb_flags |= IOCB_NOWAIT;
|
2017-06-20 19:05:43 +07:00
|
|
|
}
|
2017-06-20 19:05:40 +07:00
|
|
|
if (flags & RWF_HIPRI)
|
2020-08-01 17:36:33 +07:00
|
|
|
kiocb_flags |= IOCB_HIPRI;
|
2017-06-20 19:05:40 +07:00
|
|
|
if (flags & RWF_DSYNC)
|
2020-08-01 17:36:33 +07:00
|
|
|
kiocb_flags |= IOCB_DSYNC;
|
2017-06-20 19:05:40 +07:00
|
|
|
if (flags & RWF_SYNC)
|
2020-08-01 17:36:33 +07:00
|
|
|
kiocb_flags |= (IOCB_DSYNC | IOCB_SYNC);
|
2017-09-29 19:07:17 +07:00
|
|
|
if (flags & RWF_APPEND)
|
2020-08-01 17:36:33 +07:00
|
|
|
kiocb_flags |= IOCB_APPEND;
|
|
|
|
|
|
|
|
ki->ki_flags |= kiocb_flags;
|
2017-06-20 19:05:40 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
static inline ino_t parent_ino(struct dentry *dentry)
|
|
|
|
{
|
|
|
|
ino_t res;
|
|
|
|
|
2011-01-07 13:49:38 +07:00
|
|
|
/*
|
|
|
|
* Don't strictly need d_lock here? If the parent ino could change
|
|
|
|
* then surely we'd have a deeper race in the caller?
|
|
|
|
*/
|
2005-04-17 05:20:36 +07:00
|
|
|
spin_lock(&dentry->d_lock);
|
|
|
|
res = dentry->d_parent->d_inode->i_ino;
|
|
|
|
spin_unlock(&dentry->d_lock);
|
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Transaction based IO helpers */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* An argresp is stored in an allocated page and holds the
|
|
|
|
* size of the argument or response, along with its content
|
|
|
|
*/
|
|
|
|
struct simple_transaction_argresp {
|
|
|
|
ssize_t size;
|
|
|
|
char data[0];
|
|
|
|
};
|
|
|
|
|
|
|
|
#define SIMPLE_TRANSACTION_LIMIT (PAGE_SIZE - sizeof(struct simple_transaction_argresp))
|
|
|
|
|
|
|
|
char *simple_transaction_get(struct file *file, const char __user *buf,
|
|
|
|
size_t size);
|
|
|
|
ssize_t simple_transaction_read(struct file *file, char __user *buf,
|
|
|
|
size_t size, loff_t *pos);
|
|
|
|
int simple_transaction_release(struct inode *inode, struct file *file);
|
|
|
|
|
2009-03-25 22:48:35 +07:00
|
|
|
void simple_transaction_set(struct file *file, size_t n);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2005-05-18 19:40:59 +07:00
|
|
|
/*
|
|
|
|
* simple attribute files
|
|
|
|
*
|
|
|
|
* These attributes behave similar to those in sysfs:
|
|
|
|
*
|
|
|
|
* Writing to an attribute immediately sets a value, an open file can be
|
|
|
|
* written to multiple times.
|
|
|
|
*
|
|
|
|
* Reading from an attribute creates a buffer from the value that might get
|
|
|
|
* read with multiple read calls. When the attribute has been read
|
|
|
|
* completely, no further read calls are possible until the file is opened
|
|
|
|
* again.
|
|
|
|
*
|
|
|
|
* All attributes contain a text representation of a numeric value
|
|
|
|
* that are accessed with the get() and set() functions.
|
|
|
|
*/
|
|
|
|
#define DEFINE_SIMPLE_ATTRIBUTE(__fops, __get, __set, __fmt) \
|
|
|
|
static int __fops ## _open(struct inode *inode, struct file *file) \
|
|
|
|
{ \
|
|
|
|
__simple_attr_check_format(__fmt, 0ull); \
|
|
|
|
return simple_attr_open(inode, file, __get, __set, __fmt); \
|
|
|
|
} \
|
2009-10-02 05:43:56 +07:00
|
|
|
static const struct file_operations __fops = { \
|
2005-05-18 19:40:59 +07:00
|
|
|
.owner = THIS_MODULE, \
|
|
|
|
.open = __fops ## _open, \
|
2008-02-08 19:20:28 +07:00
|
|
|
.release = simple_attr_release, \
|
2005-05-18 19:40:59 +07:00
|
|
|
.read = simple_attr_read, \
|
|
|
|
.write = simple_attr_write, \
|
2010-08-16 02:50:52 +07:00
|
|
|
.llseek = generic_file_llseek, \
|
2014-08-07 06:08:45 +07:00
|
|
|
}
|
2005-05-18 19:40:59 +07:00
|
|
|
|
2011-11-01 07:11:33 +07:00
|
|
|
static inline __printf(1, 2)
|
|
|
|
void __simple_attr_check_format(const char *fmt, ...)
|
2005-05-18 19:40:59 +07:00
|
|
|
{
|
|
|
|
/* don't do anything, just let the compiler check the arguments; */
|
|
|
|
}
|
|
|
|
|
|
|
|
int simple_attr_open(struct inode *inode, struct file *file,
|
2008-02-08 19:20:26 +07:00
|
|
|
int (*get)(void *, u64 *), int (*set)(void *, u64),
|
2005-05-18 19:40:59 +07:00
|
|
|
const char *fmt);
|
2008-02-08 19:20:28 +07:00
|
|
|
int simple_attr_release(struct inode *inode, struct file *file);
|
2005-05-18 19:40:59 +07:00
|
|
|
ssize_t simple_attr_read(struct file *file, char __user *buf,
|
|
|
|
size_t len, loff_t *ppos);
|
|
|
|
ssize_t simple_attr_write(struct file *file, const char __user *buf,
|
|
|
|
size_t len, loff_t *ppos);
|
|
|
|
|
2007-10-17 13:26:21 +07:00
|
|
|
struct ctl_table;
|
2009-09-24 05:57:19 +07:00
|
|
|
int proc_nr_files(struct ctl_table *table, int write,
|
2020-04-24 13:43:38 +07:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos);
|
2010-10-10 16:36:23 +07:00
|
|
|
int proc_nr_dentry(struct ctl_table *table, int write,
|
2020-04-24 13:43:38 +07:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos);
|
2010-10-23 16:03:02 +07:00
|
|
|
int proc_nr_inodes(struct ctl_table *table, int write,
|
2020-04-24 13:43:38 +07:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos);
|
2009-04-09 18:17:52 +07:00
|
|
|
int __init get_filesystem_list(char *buf);
|
2007-07-17 18:03:45 +07:00
|
|
|
|
2011-02-02 06:52:46 +07:00
|
|
|
#define __FMODE_EXEC ((__force int) FMODE_EXEC)
|
2011-02-02 06:52:46 +07:00
|
|
|
#define __FMODE_NONOTIFY ((__force int) FMODE_NONOTIFY)
|
|
|
|
|
2009-12-24 18:58:56 +07:00
|
|
|
#define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
|
2009-12-18 09:24:25 +07:00
|
|
|
#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE) | \
|
2011-02-02 06:52:46 +07:00
|
|
|
(flag & __FMODE_NONOTIFY)))
|
2009-12-19 22:15:07 +07:00
|
|
|
|
2015-11-19 20:00:12 +07:00
|
|
|
static inline bool is_sxid(umode_t mode)
|
2011-05-28 22:25:51 +07:00
|
|
|
{
|
|
|
|
return (mode & S_ISUID) || ((mode & S_ISGID) && (mode & S_IXGRP));
|
|
|
|
}
|
|
|
|
|
2014-10-24 05:14:36 +07:00
|
|
|
static inline int check_sticky(struct inode *dir, struct inode *inode)
|
|
|
|
{
|
|
|
|
if (!(dir->i_mode & S_ISVTX))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return __check_sticky(dir, inode);
|
|
|
|
}
|
|
|
|
|
2011-05-28 22:25:51 +07:00
|
|
|
static inline void inode_has_no_xattr(struct inode *inode)
|
|
|
|
{
|
2017-07-17 14:45:35 +07:00
|
|
|
if (!is_sxid(inode->i_mode) && (inode->i_sb->s_flags & SB_NOSEC))
|
2011-05-28 22:25:51 +07:00
|
|
|
inode->i_flags |= S_NOSEC;
|
|
|
|
}
|
|
|
|
|
2014-10-22 02:20:42 +07:00
|
|
|
static inline bool is_root_inode(struct inode *inode)
|
|
|
|
{
|
|
|
|
return inode == inode->i_sb->s_root->d_inode;
|
|
|
|
}
|
|
|
|
|
2013-05-16 07:23:06 +07:00
|
|
|
static inline bool dir_emit(struct dir_context *ctx,
|
|
|
|
const char *name, int namelen,
|
|
|
|
u64 ino, unsigned type)
|
|
|
|
{
|
|
|
|
return ctx->actor(ctx, name, namelen, ctx->pos, ino, type) == 0;
|
|
|
|
}
|
|
|
|
static inline bool dir_emit_dot(struct file *file, struct dir_context *ctx)
|
|
|
|
{
|
|
|
|
return ctx->actor(ctx, ".", 1, ctx->pos,
|
|
|
|
file->f_path.dentry->d_inode->i_ino, DT_DIR) == 0;
|
|
|
|
}
|
|
|
|
static inline bool dir_emit_dotdot(struct file *file, struct dir_context *ctx)
|
|
|
|
{
|
|
|
|
return ctx->actor(ctx, "..", 2, ctx->pos,
|
|
|
|
parent_ino(file->f_path.dentry), DT_DIR) == 0;
|
|
|
|
}
|
|
|
|
static inline bool dir_emit_dots(struct file *file, struct dir_context *ctx)
|
|
|
|
{
|
|
|
|
if (ctx->pos == 0) {
|
|
|
|
if (!dir_emit_dot(file, ctx))
|
|
|
|
return false;
|
|
|
|
ctx->pos = 1;
|
|
|
|
}
|
|
|
|
if (ctx->pos == 1) {
|
|
|
|
if (!dir_emit_dotdot(file, ctx))
|
|
|
|
return false;
|
|
|
|
ctx->pos = 2;
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
2013-05-16 08:02:48 +07:00
|
|
|
static inline bool dir_relax(struct inode *inode)
|
|
|
|
{
|
2016-01-23 03:40:57 +07:00
|
|
|
inode_unlock(inode);
|
|
|
|
inode_lock(inode);
|
2013-05-16 08:02:48 +07:00
|
|
|
return !IS_DEADDIR(inode);
|
|
|
|
}
|
2013-05-16 07:23:06 +07:00
|
|
|
|
2016-05-13 07:36:01 +07:00
|
|
|
static inline bool dir_relax_shared(struct inode *inode)
|
|
|
|
{
|
|
|
|
inode_unlock_shared(inode);
|
|
|
|
inode_lock_shared(inode);
|
|
|
|
return !IS_DEADDIR(inode);
|
|
|
|
}
|
|
|
|
|
2015-06-30 02:42:03 +07:00
|
|
|
extern bool path_noexec(const struct path *path);
|
2015-11-17 13:07:57 +07:00
|
|
|
extern void inode_nohighmem(struct inode *inode);
|
2015-06-30 02:42:03 +07:00
|
|
|
|
2018-08-27 19:56:02 +07:00
|
|
|
/* mm/fadvise.c */
|
|
|
|
extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len,
|
|
|
|
int advice);
|
2019-08-29 23:04:11 +07:00
|
|
|
extern int generic_fadvise(struct file *file, loff_t offset, loff_t len,
|
|
|
|
int advice);
|
2018-08-27 19:56:02 +07:00
|
|
|
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 00:46:33 +07:00
|
|
|
#if defined(CONFIG_IO_URING)
|
|
|
|
extern struct sock *io_uring_get_socket(struct file *file);
|
|
|
|
#else
|
|
|
|
static inline struct sock *io_uring_get_socket(struct file *file)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2019-07-01 22:25:34 +07:00
|
|
|
int vfs_ioc_setflags_prepare(struct inode *inode, unsigned int oldflags,
|
|
|
|
unsigned int flags);
|
|
|
|
|
2019-07-01 22:25:35 +07:00
|
|
|
int vfs_ioc_fssetxattr_check(struct inode *inode, const struct fsxattr *old_fa,
|
|
|
|
struct fsxattr *fa);
|
|
|
|
|
|
|
|
static inline void simple_fill_fsxattr(struct fsxattr *fa, __u32 xflags)
|
|
|
|
{
|
|
|
|
memset(fa, 0, sizeof(*fa));
|
|
|
|
fa->fsx_xflags = xflags;
|
|
|
|
}
|
|
|
|
|
2019-08-20 21:55:16 +07:00
|
|
|
/*
|
|
|
|
* Flush file data before changing attributes. Caller must hold any locks
|
|
|
|
* required to prevent further writes to this file until we're done setting
|
|
|
|
* flags.
|
|
|
|
*/
|
|
|
|
static inline int inode_drain_writes(struct inode *inode)
|
|
|
|
{
|
|
|
|
inode_dio_wait(inode);
|
|
|
|
return filemap_write_and_wait(inode->i_mapping);
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#endif /* _LINUX_FS_H */
|