[ Upstream commit 0fdec1b3c9fbb5e856a40db5993c9eaf91c74a83 ]
Orangefs df output is whacky. Walt Ligon suggested this might fix it.
It seems way more in line with reality now...
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
The variable ret is guaranteed to be 0 in this if (). So we can remove
this assignement.
Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Al Viro pointed out that I broke some acl functionality...
* ACLs could not be fully removed
* posix_acl_chmod would be called while the old ACL was still cached
* new mode propagated to orangefs server before ACL.
... when I tried to make sure that modes that got changed as a
result of ACL-sets would be sent back to the orangefs server.
Not wanting to try and change the code without having some cases to
test it with, I began to hunt for setfacl examples that were expressible
in pure mode. Along the way I found examples like the following
which confused me:
user A had a file (/home/A/asdf) with mode 740
user B was in user A's group
user C was not in user A's group
setfacl -m u:C:rwx /home/A/asdf
The above setfacl caused ls -l /home/A/asdf to show a mode of 770,
making it appear that all users in user A's group now had full access
to /home/A/asdf, however, user B still only had read acces. Madness.
Anywho, I finally found that the above (whacky as it is) appears to
be "posixly on purpose" and explained in acl(5):
If the ACL has an ACL_MASK entry, the group permissions correspond
to the permissions of the ACL_MASK entry.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Conversion: John Hubbard's conversion from get_user_pages() to pin_user_pages()
cleanup: Colin Ian King's removal of an unneeded variable initialization.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEIGSFVdO6eop9nER2z0QOqevODb4FAl7ao4AACgkQz0QOqevO
Db4MqRAAjb+nRMDA6aYfCY1Veh9BUnVeVPAqWdlSPbo7TdDfhCBJKgwZ8Y3GmOnR
vYnVn6Yz/uhFwoZWmSd0AXgB55kGKyXNcHlyPO4FcaWGDNd/Dn8WWc7/lsqHnDFX
cg0Ioy7VeYS+Y+iW3ZkEjkeGyVFProl/OsJrf9vfJiuyZrLp4th/ctlbV/sIA2R6
XNk0ld21gEB5YbrTCQebtyhdJLp9+hhI0BB2Lxm/JUZyyK4J2rN8H+SF/5JE8wEj
SJD7K5kukxED2Kh3pU1fvRVr0VvHZjLHQav6TgW6GPokmb6EWZwvIYfUa8go50Jz
5fkpyRc8d3zibxDSdL6/Gr6mZQxceFnYvfPs27Vq3O08J/dhWHX0LlKctD4pEHNR
NUsF8Piko+16JwPg+EeXTIsYrMiW8g5FhoTCl7FILQ06F2P6NDCyOUyHiLOU4KaN
+bp6YmIyJBfIBgXz58Pqq2JwibkN6zYiRrb5jDDWv2Nw9ykpjNykrAXH6Fbz8JKV
dPbKkxzx5HQuHBaxYCRKV4r3UNBYKAECrWcLQglG8D+EzvYrBxFZQoPu3qapJPJG
8fQKCKqU8hM6BGYpI76FIK8o/BrOdYr0ttqDsambDWt/uRXRKXwvKJVz3wOB8Q7n
om+XlIk4UE/7vgpTru/wzVX0pmq8WWskbDOKkGqgeUC0BLuZcJw=
=3A+d
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
- John Hubbard's conversion from get_user_pages() to pin_user_pages()
- Colin Ian King's removal of an unneeded variable initialization
* tag 'for-linus-5.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: convert get_user_pages() --> pin_user_pages()
orangefs: remove redundant assignment to variable ret
Since the new pair function is introduced, we can call them to clean the
code in orangefs.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Martin Brandenburg <martin@omnibond.com>
Link: http://lkml.kernel.org/r/20200517214718.468-9-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This code was using get_user_pages*(), in a "Case 1" scenario
(Direct IO), using the categorization from [1]. That means that it's
time to convert the get_user_pages*() + put_page() calls to
pin_user_pages*() + unpin_user_pages() calls.
There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.
[1] Documentation/core-api/pin_user_pages.rst
[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Martin Brandenburg <martin@omnibond.com>
Cc: devel@lists.orangefs.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
The variable ret is being initialized with a value that is
never read and it is being updated later with a new value. The
initialization is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Christoph Hellwig noticed that we were doing some unnecessary
work in orangefs_flush:
orangefs_flush just writes out data on every close(2) call. There is
no need to change anything about the dirty state, especially as
orangefs doesn't treat I_DIRTY_TIMES special in any way. The code
seems to come from partially open coding vfs_fsync.
He sent in a patch with the above commit message and also a
patch that was a reversion of another Orangefs patch I had
sent upstream a while ago. I had to fix his reversion patch
so that it would compile which caused his "don't mess with
I_DIRTY_TIMES" patch to fail to apply. So here I have just
remade his patch and applied it after the fixed reversion patch.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Christoph Hellwig sent in a reversion of "orangefs: remember count
when reading." because:
->read_iter calls can race with each other and one or
more ->flush calls. Remove the the scheme to store the read
count in the file private data as is is completely racy and
can cause use after free or double free conditions
Christoph's reversion caused Orangefs not to work or to compile. I
added a patch that fixed that, but intel's kbuild test robot pointed
out that sending Christoph's patch followed by my patch upstream, it
would break bisection because of the failure to compile. So I have
combined the reversion plus my patch... here's the commit message
that was in my patch:
Logically, optimal Orangefs "pages" are 4 megabytes. Reading
large Orangefs files 4096 bytes at a time is like trying to
kick a dead whale down the beach. Before Christoph's "Revert
orangefs: remember count when reading." I tried to give users
a knob whereby they could, for example, use "count" in
read(2) or bs with dd(1) to get whatever they considered an
appropriate amount of bytes at a time from Orangefs and fill
as many page cache pages as they could at once.
Without the racy code that Christoph reverted Orangefs won't
even compile, much less work. So this replaces the logic that
used the private file data that Christoph reverted with
a static number of bytes to read from Orangefs.
I ran tests like the following to determine what a
reasonable static number of bytes might be:
dd if=/pvfsmnt/asdf of=/dev/null count=128 bs=4194304
dd if=/pvfsmnt/asdf of=/dev/null count=256 bs=2097152
dd if=/pvfsmnt/asdf of=/dev/null count=512 bs=1048576
.
.
.
dd if=/pvfsmnt/asdf of=/dev/null count=4194304 bs=128
Reads seem faster using the static number, so my "knob code"
wasn't just racy, it wasn't even a good idea...
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Reported-by: kbuild test robot <lkp@intel.com>
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.
https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Orangefs has no open, and orangefs checks file permissions
on each file access. Posix requires that file permissions
be checked on open and nowhere else. Orangefs-through-the-kernel
needs to seem posix compliant.
The VFS opens files, even if the filesystem provides no
method. We can see if a file was successfully opened for
read and or for write by looking at file->f_mode.
When writes are flowing from the page cache, file is no
longer available. We can trust the VFS to have checked
file->f_mode before writing to the page cache.
The mode of a file might change between when it is opened
and IO commences, or it might be created with an arbitrary mode.
We'll make sure we don't hit EACCES during the IO stage by
using UID 0. Some of the time we have access without changing
to UID 0 - how to check?
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
fix: way back in the stone age (2003) mode was set to the magic
number "755" in what is now fs/orangefs/namei.c(orangefs_symlink).
Łukasz Wrochna reported it and Artur Świgoń sent in a patch to change
it to octal. Maybe it shouldn't be a magic number at all but rather
something like "S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH"...
cleanup: Colin Ian King found a redundant assignment and sent in a
patch to remove it.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJdep5RAAoJEM9EDqnrzg2++gsP/AwoACf0h1sLApzxXn+eVcUr
5C454D/tEK5dCS2j3YAOs5O8CXhmV/PHz3foP91d7kdB0WzEWvVS6EZOSCyNCGAk
XOBWkKEk8yOdFCoBfPsUusJbkHjB1d5npfACiodQk25YgqFXlLGX8qD5KtsYX6hK
ANMA8pcsADHHINjfCtEkhmvdrYRnWdYBRXhdpSrv6HoREhmi04+C0loGQi+omm5T
NL+kTBjkUsoENsGoW4hMbv1UlBNkqV1IPFjJfrHPI8FhNG4fjC0rLh+9kxE9dscZ
ugCFfvGcsBQNxluGN/c6d2NYrGLoYhNImwa7Cx3E2kh9nbtPyH6QdvrVZECjl6zJ
1DoDKi/bWHYHLzDciBaBiIU/TY1NeHvEYOTOuqty8LdN89E5Lmh7pMu+xDyFnebL
SLyHf1L5BsxKxyvP97ZcDkiz7XIccIrhQTtF92QihdKJotDW1ONRBvX9ZgrqUQR5
3RTgPv1LY8yXr0263FsHQUv7NsHP49JdeY6Cn1OjujlSJ5j7t9smmO3CDqLQtLHU
6Rh2aEL359GDN6DLTOp4ChgxQO5pJIc7kIgqj+xUrPz8lHewDaYCmmFLoIoOcaMw
L+Z3w9uaGwdonBnl72i18FS9oVrtQoCaDKfEtndJYPE6pgCjle5FOsAIAmkpn3/a
mhWu/jjOQKsumdIjxQu3
=ZJ67
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.4-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"A fix and a cleanup.
The fix: way back in the stone age (2003) mode was set to the magic
number "755" in what is now fs/orangefs/namei.c(orangefs_symlink).
Łukasz Wrochna reported it and Artur Świgoń sent in a patch to change
it to octal. Maybe it shouldn't be a magic number at all but rather
something like "S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH"...
cleanup: Colin Ian King found a redundant assignment and sent in a
patch to remove it"
[ And no, octal numbers for permissions are a lot more legible than a
binary 'or' of some line noise macros. So 0755 is preferred over
trying to spell it out using "helpful" macros - Linus ]
* tag 'for-linus-5.4-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: remove redundant assignment to err
orangefs: Add octal zero prefix
Variable err is initialized to a value that is never read and it
is re-assigned later. The initialization is redundant and can
be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This patch adds a missing zero to mode 755 specification required to
express it in octal numeral system.
Reported-by: Łukasz Wrochna <l.wrochna@samsung.com>
Signed-off-by: Artur Świgoń <a.swigon@partner.samsung.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This file has its own proper style, except that, after a while,
the coding style gets violated and whitespaces are placed on
different ways.
As Sphinx and ReST are very sentitive to whitespace differences,
I had to opt if each entry after required/mandatory/... fields
should start with zero spaces or with a tab. I opted to start them
all from the zero position, in order to avoid needing to break lines
with more than 80 columns, with would make harder for review.
Most of the other changes at porting.rst were made to use an unified
notation with works nice as a text file while also produce a good html
output after being parsed.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
There are 3 remaining files without an extension inside the fs docs
dir.
Manually convert them to ReST.
In the case of the nfs/exporting.rst file, as the nfs docs
aren't ported yet, I opted to convert and add a :orphan: there,
with should be removed when it gets added into a nfs-specific
part of the fs documentation.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Unused Value that colin.king@canonical.com sent me and a
related fix I added.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJdLflkAAoJEM9EDqnrzg2+XEIQAImX96e5A/gDSAFhL23cIW73
2mAeCjwvpHk9hmLESpek3NfKH9fBkKS07bv0ICr1P4vSW2KIecbyc6Megern8w6H
WuY0WSELCDy207zqHf1BeGwEMC8Z/MadVSP/2wed7E0+AppcsPeGhe5yFlD8MKYq
sFBoZ3tQ88iujNqlORlPy7bWJzsx3a9MYznDlbeIVY7gsgnFrgyHCoY+cpZTG9fr
ZEQmyw2B7NWwzu438i3Qk+MDiztq4ldwhsO8HFt3+qTBtfUzmb7upWFMM1AUWQxd
3XIYZiCXgPmfIsQQJE5HPeE10zrMNVRYu6CjeUZPl11iy+/OKcY/UVQCwNxbQFAc
ZBFk6OIj9wZsMWAgnlp6zUUJ/s//OudnuSbbHlj3VDVEa+aZSJawgoIsauCyyZc/
5/sDpKcjb5Z4O5Sd+BkT8Zg0iAszBx7aJSqj/ggPvqxt7UQ9dF97AgYdZYfav/br
xL2O9KNfWGHj+qUV8qmQXVMJZWPrHZjQMPqA4Oi5CxmW+eF1kxr2mw3bU7suHXKm
CA7fRECsiyo2ULIl3wT12cOrpuFRQegkUW6AuBq5Z2JNw+H15JCbPZVMllfj7xye
TAxXXp9atcyRpCOrtB3dRO0itnGDrWx4WAxaKR127IeqKAiP2ZsPceu0mILamPno
ekJLcwC8kwbUDrNHR170
=PJuy
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.3-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"Two small fixes.
This is just a fix for an unused value that Colin King sent me and a
related fix I added"
* tag 'for-linus-5.3-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: eliminate needless variable assignments
orangefs: remove redundant assignment to variable buffer_index
- Standardize parameter checking for the SETFLAGS and FSSETXATTR ioctls
(which were the file attribute setters for ext4 and xfs and have now
been hoisted to the vfs)
- Only allow the DAX flag to be set on files and directories.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl0aJgMACgkQ+H93GTRK
tOuKkg//SJaxcB63uVPZk9hDraYTmyo9OXRRX6X9WwDKPTWwa88CUwS1ny1QF7Mt
zMkgzG2/y2Rs9PQ0ARoPbh1hNb2CXnvA+xnzUEev1MW6UN/nTFMZEOPn2ZQ+DxQE
gg/0U56kKgtjtXzBZVpTgHzSETivdXwHxFW3hiTtyRXg+4ulgDIZLOjN2wRB+Pdb
X8ZmM6MqKOTbhQEXlw13TlCKBzoMjC1w4UU4rkZPjoSjAaUWiPfrk/XU7qgguf9p
v1dbSN2dADQ19jzZ1dmggXnlJsRMZjk/ls5rxJlB5DHDbh6YgnA2TE+tYrtH28eB
uyKfD+RQnMzRVdmH8PsMQRQQFXR2UYyprVP7a6wi6TkB+gytn7sR5uT4sbAhmhcF
TiTYfYNRXzemHCewyOwOsUE/7oCeiJcdbqiPAHHD/jYLZfRjSXDcGzz3+7ZYZ3GO
hRxUhpxHPbkmK4T2OxhzReCbRsLN/0BeEcDdLkNWmi2FTh3V1gYzMGkgI9wsVbsd
pHjoGIHbMPWqktF/obuGq96WVfYBBaWJ6WNzQqKT4dQYAJBW2omxitXQHLpi6cjt
hG5ncxa3cPpWx4t3Lx2hb0TPS7RyYvuoQIcS/Me2RWioxrwWrgnOqdHFfLEwWpfN
jRowdWiGgOIsq8hMt7qycmGCXzbgsbaA/7oRqh8TiwM9taPOM4c=
=uH2E
-----END PGP SIGNATURE-----
Merge tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull common SETFLAGS/FSSETXATTR parameter checking from Darrick Wong:
"Here's a patch series that sets up common parameter checking functions
for the FS_IOC_SETFLAGS and FS_IOC_FSSETXATTR ioctl implementations.
The goal here is to reduce the amount of behaviorial variance between
the filesystems where those ioctls originated (ext2 and XFS,
respectively) and everybody else.
- Standardize parameter checking for the SETFLAGS and FSSETXATTR
ioctls (which were the file attribute setters for ext4 and xfs and
have now been hoisted to the vfs)
- Only allow the DAX flag to be set on files and directories"
* tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
vfs: only allow FSSETXATTR to set DAX flag on files and dirs
vfs: teach vfs_ioc_fssetxattr_check to check extent size hints
vfs: teach vfs_ioc_fssetxattr_check to check project id info
vfs: create a generic checking function for FS_IOC_FSSETXATTR
vfs: create a generic checking and prep function for FS_IOC_SETFLAGS
The variable buffer_index is being initialized however this is never
read and later it is being reassigned to a new value. The initialization
is redundant and hence can be removed.
Addresses-Coverity: ("Unused Value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Stephen writes:
After merging the driver-core tree, today's linux-next build (x86_64
allmodconfig) produced this warning:
fs/orangefs/orangefs-debugfs.c: In function 'orangefs_debugfs_init':
fs/orangefs/orangefs-debugfs.c:193:1: warning: label 'out' defined but not used [-Wunused-label]
out:
^~~
fs/orangefs/orangefs-debugfs.c: In function 'orangefs_kernel_debug_init':
fs/orangefs/orangefs-debugfs.c:204:17: warning: unused variable 'ret' [-Wunused-variable]
struct dentry *ret;
^~~
Fix this up and change the return type of the function to void as it can
not fail, which cleans up some more code and variables as well.
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Martin Brandenburg <martin@omnibond.com>
Cc: devel@lists.orangefs.org
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: f095adba36 ("orangefs: no need to check return value of debugfs_create functions")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Martin Brandenburg <martin@omnibond.com>
Cc: devel@lists.orangefs.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20190612152204.GA17511@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Create a generic function to check incoming FS_IOC_SETFLAGS flag values
and later prepare the inode for updates so that we can standardize the
implementations that follow ext4's flag values.
Note that the efivarfs implementation no longer fails a no-op SETFLAGS
without CAP_LINUX_IMMUTABLE since that's the behavior in ext*.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Add SPDX license identifiers to all Make/Kconfig files which:
- Have no license information of any form
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have MODULE_LICENCE("GPL*") inside which was used in the initial
scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To facilitate additional options to get_user_pages_fast() change the
singular write parameter to be gup_flags.
This patch does not change any functionality. New functionality will
follow in subsequent patches.
Some of the get_user_pages_fast() call sites were unchanged because they
already passed FOLL_WRITE or 0 for the write parameter.
NOTE: It was suggested to change the ordering of the get_user_pages_fast()
arguments to ensure that callers were converted. This breaks the current
GUP call site convention of having the returned pages be the final
parameter. So the suggestion was rejected.
Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marshall <hubcap@omnibond.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
the pagecache" patch series which greatly improves our small IO
performance and helps us pass more xfstests than before.
fix: orangefs: truncate before updating size
Pagecache series: all the rest
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJc1CucAAoJEM9EDqnrzg2+2h4P/i2ktG4Zi3VDwCr1xea0bWSf
kidYxzVPmGAmfC+UNMn/+X4v/R6rtHcpW/g7/0WLqhLUhRwjB1gzT6U9mHfu8BQV
1I7tg6lQqcTi4gNkBMWI4MHKRyoQ5bAgY7jFxLjqx1zj5+qxrtPn4GUCUbDBDPpP
Ip9RD/izEKeG77gdSUkCD72ojHl9lqmT1Qd3k3asmozzdWzd8mdXy8EQplECo7tZ
1tLzV+BSER2BVQfmws2CCG00DLFzQRJkDYxm7I5l1xoQ/Ft8n6k/tXZ1QzkPREg2
FhaoY5eCA/2MkB0VhUKj1Y+gXNZcbys11oxcqyZZ6p2kOu+1bY+evkzwDUzsIAg+
nl1d3LoeEVKoxpa4O+vCHLF++HX8TuVHSaxvNaZPMm86gxowCh0aberyWcscGbQp
m2M28WdjlzYC7uPSakoKVvF+ZBNKQP5sn6jbjYSOHAB9nFz6x5ILH3SMK8MXoZZe
2Ejd4NBmtqekdLp2MNlAWHF1O9sTagH+OqUvbNRNjrQXM3hNSgeguGx1b0E+KlT2
lwKORIVwZNZ/xuizdwM01kWRQObAmGrptlVbR4+HKqo99g9gGv6sdw//cHzo9/2S
GV34cpS5VRSwHVwuzADD1A/qQ+k5xomihrpZwcqh+YOwQV16JnTHiSeTYZKVIEW8
cBiZ2k5dbKm1QcLc9tJ3
=uBS+
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.2-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"This includes one fix and our "Orangefs through the pagecache" patch
series which greatly improves our small IO performance and helps us
pass more xfstests than before.
Fix:
- orangefs: truncate before updating size
Pagecache series:
- all the rest"
* tag 'for-linus-5.2-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: (23 commits)
orangefs: truncate before updating size
orangefs: copy Orangefs-sized blocks into the pagecache if possible.
orangefs: pass slot index back to readpage.
orangefs: remember count when reading.
orangefs: add orangefs_revalidate_mapping
orangefs: implement writepages
orangefs: write range tracking
orangefs: avoid fsync service operation on flush
orangefs: skip inode writeout if nothing to write
orangefs: move do_readv_writev to direct_IO
orangefs: do not return successful read when the client-core disappeared
orangefs: implement writepage
orangefs: migrate to generic_file_read_iter
orangefs: service ops done for writeback are not killable
orangefs: remove orangefs_readpages
orangefs: reorganize setattr functions to track attribute changes
orangefs: let setattr write to cached inode
orangefs: set up and use backing_dev_info
orangefs: hold i_lock during inode_getattr
orangefs: update attributes rather than relying on server
...
Otherwise we race with orangefs_writepage/orangefs_writepages
which and does not expect i_size < page_offset.
Fixes xfstests generic/129.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
->readpage looks in file->private_data to try and find out how the
userspace program set "count" in read(2) or with "dd bs=" or whatever.
->readpage uses "count" and inode->i_size to calculate how much
data Orangefs should deposit in the Orangefs shared buffer, and
remembers which slot the data is in.
After copying data from the Orangefs shared buffer slot into
"the page", readpage tries to increment through the pagecache index
and fill as many pages as it can from the extra data in the shared
buffer. Hopefully these extra pages will soon be needed by the vfs,
and they'll be in the pagecache already.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
When userspace deposits more than a page of data into the shared buffer,
we'll need to know which slot it is in when we get back to readpage
so that we can try to use the extra data to fill some extra pages.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Orangefs wins when it can do IO on large (up to four meg) blocks at a time,
and looses when it has to do tiny "small io" reads and writes. Accessing
Orangefs through the pagecache with the kernel module helps with small io,
both reading and writing, a great deal. Readpage generally tries to fetch a
page (four k) at a time. We'll let users use "count" (as in read(2) or
pread(2) for example) as a knob to control how much data they get from
Orangefs at a time and we'll try to use the data to fill extra
pagecache pages when we get to ->readpage, hopefully resulting in
fewer calls to readpage and Orangefs userspace.
We need a way to remember how they set count so that we can still have
it available when we get to ->readpage.
- We'll use file->private_data to keep track of "count".
We'll wrap generic_file_open with orangefs_file_open and
initialize private_data to NULL there.
- In ->read_iter we have access to both "count" and file, so
we'll kmalloc some space onto file->private_data and store
"count" there.
- We'll kfree file->private_data each time we visit ->flush and
reinitialize it to NULL.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
This is modeled after NFS, except our method is different. We use a
simple timer to determine whether to invalidate the page cache. This
is bound to perform.
This addes a sysfs parameter cache_timeout_msecs which controls the time
between page cache invalidations.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Go through pages and look for a consecutive writable region. After
finding a number of consecutive writable pages or when finding that
the next page's dirty range is not contiguous and cannot be written
as one request, send the write to the server.
The number of pages is determined by the client-core's buffer size.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Attach the actual range of bytes written to plus the responsible uid/gid
to each dirty page. This information must be sent to the server when
the page is written out.
Now write_begin, page_mkwrite, and invalidatepage keep up with this
information. There are several conditions where they must write out the
page immediately to store the new range. Two non-contiguous ranges
cannot be stored on a single page.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Without this, an fsync call is sent to the server even if no data
changed. This resulted in a rather severe (50%) performance regression
under certain metadata-heavy workloads.
In the past, everything was direct IO. Nothing happend on a close call.
An explicit fsync call would send an fsync request to the server which
in turn fsynced the underlying file.
Now there are cached writes. Then fsync began writing out dirty pages
in addition to making an fsync request to the server, and close began
calling fsync.
With this commit, close only writes out dirty pages, and does not make
the fsync request.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Would happen if an inode is dirty but whatever happened is not something
that can be written out to OrangeFS.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
direct_IO was the only caller and all direct_IO did was call it,
so there's no use in having the code spread out into so many functions.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Now orangefs_inode_getattr fills from cache if an inode has dirty pages.
also if attr_valid and dirty pages and !flags, we spin on inode writeback
before returning if pages still dirty after: should it be other way
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Remove orangefs_inode_read. It was used by readpage. Calling
wait_for_direct_io directly serves the purpose just as well. There is
now no check of the bufmap size in the readpage path. There are already
other places the bufmap size is assumed to be greater than PAGE_SIZE.
Important to call truncate_inode_pages now in the write path so a
subsequent read sees the new data.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
It's a copy of the loop which would run in read_pages from
mm/readahead.c.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
OrangeFS accepts a mask indicating which attributes were changed. The
kernel must not set any bits except those that were actually changed.
The kernel must set the uid/gid of the request to the actual uid/gid
responsible for the change.
Code path for notify_change initiated setattrs is
orangefs_setattr(dentry, iattr)
-> __orangefs_setattr(inode, iattr)
In kernel changes are initiated by calling __orangefs_setattr.
Code path for writeback is
orangefs_write_inode
-> orangefs_inode_setattr
attr_valid and attr_uid and attr_gid change together under i_lock.
I_DIRTY changes separately.
__orangefs_setattr
lock
if needs to be cleaned first, unlock and retry
set attr_valid
copy data in
unlock
mark_inode_dirty
orangefs_inode_setattr
lock
copy attributes out
unlock
clear getattr_time
# __writeback_single_inode clears dirty
orangefs_inode_getattr
# possible to get here with attr_valid set and not dirty
lock
if getattr_time ok or attr_valid set, unlock and return
unlock
do server operation
# another thread may getattr or setattr, so check for that
lock
if getattr_time ok or attr_valid, unlock and return
else, copy in
update getattr_time
unlock
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This is a fairly big change, but ultimately it's not a lot of code.
Implement write_inode and then avoid the call to orangefs_inode_setattr
within orangefs_setattr.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This should be a no-op now. When inode writeback works, this will
prevent a getattr from overwriting inode data while an inode is
transitioning to dirty.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This should be a no-op now, but once inode writeback works, it'll be
necessary to have the correct attribute in the dirty inode.
Previously the attribute fetch timeout was marked invalid and the server
provided the updated attribute. When the inode is dirty, the server
cannot be consulted since it does not yet know the pending setattr.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
No need to store the received mask. It is either STATX_BASIC_STATS or
STATX_BASIC_STATS & ~STATX_SIZE. If STATX_SIZE is requested, the cache
is bypassed anyway, so the cached mask is unnecessary to decide whether
to do a real getattr.
This is a change. Previously a getattr would want size and use the
cached size. All of the in-kernel callers that wanted size did not want
a cached size. Now a getattr cannot use the cached size if it wants
size at all.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
When an inode is created, we fetch attributes from the server. There is
no need to turn around and invalidate them.
No need to initialize attributes after the getattr either. Either it'll
be exactly the same, or it'll be something else and wrong.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This uses the same timeout as the getattr cache. This substantially
increases performance when writing files with smaller buffer sizes.
When writing, the size is (often) changed, which causes a call to
notify_change which calls security_inode_need_killpriv which needs a
getxattr. Caching it reduces traffic to the server.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>