Commit Graph

957 Commits

Author SHA1 Message Date
Jan Kara
82a25b027c ext4: wait for outstanding dio during truncate in nojournal mode
We didn't wait for outstanding direct IO during truncate in nojournal
mode (as we skip orphan handling in that case). This can lead to fs
corruption or stale data exposure if truncate ends up freeing blocks
and these get reallocated before direct IO finishes. Fix the condition
determining whether the wait is necessary.

CC: stable@vger.kernel.org
Fixes: 1c9114f9c0 ("ext4: serialize unlocked dio reads with truncate")
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-05-23 23:07:08 -04:00
Gabriel Krisman Bertazi
b886ee3e77 ext4: Support case-insensitive file name lookups
This patch implements the actual support for case-insensitive file name
lookups in ext4, based on the feature bit and the encoding stored in the
superblock.

A filesystem that has the casefold feature set is able to configure
directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups
to succeed in that directory in a case-insensitive fashion, i.e: match
a directory entry even if the name used by userspace is not a byte per
byte match with the disk name, but is an equivalent case-insensitive
version of the Unicode string.  This operation is called a
case-insensitive file name lookup.

The feature is configured as an inode attribute applied to directories
and inherited by its children.  This attribute can only be enabled on
empty directories for filesystems that support the encoding feature,
thus preventing collision of file names that only differ by case.

* dcache handling:

For a +F directory, Ext4 only stores the first equivalent name dentry
used in the dcache. This is done to prevent unintentional duplication of
dentries in the dcache, while also allowing the VFS code to quickly find
the right entry in the cache despite which equivalent string was used in
a previous lookup, without having to resort to ->lookup().

d_hash() of casefolded directories is implemented as the hash of the
casefolded string, such that we always have a well-known bucket for all
the equivalencies of the same string. d_compare() uses the
utf8_strncasecmp() infrastructure, which handles the comparison of
equivalent, same case, names as well.

For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries.  This is bad for performance but requires some leveraging of
the vfs layer to fix.  We can live without that for now, and so does
everyone else.

* on-disk data:

Despite using a specific version of the name as the internal
representation within the dcache, the name stored and fetched from the
disk is a byte-per-byte match with what the user requested, making this
implementation 'name-preserving'. i.e. no actual information is lost
when writing to storage.

DX is supported by modifying the hashes used in +F directories to make
them case/encoding-aware.  The new disk hashes are calculated as the
hash of the full casefolded string, instead of the string directly.
This allows us to efficiently search for file names in the htree without
requiring the user to provide an exact name.

* Dealing with invalid sequences:

By default, when a invalid UTF-8 sequence is identified, ext4 will treat
it as an opaque byte sequence, ignoring the encoding and reverting to
the old behavior for that unique file.  This means that case-insensitive
file name lookup will not work only for that file.  An optional bit can
be set in the superblock telling the filesystem code and userspace tools
to enforce the encoding.  When that optional bit is set, any attempt to
create a file name using an invalid UTF-8 sequence will fail and return
an error to userspace.

* Normalization algorithm:

The UTF-8 algorithms used to compare strings in ext4 is implemented
lives in fs/unicode, and is based on a previous version developed by
SGI.  It implements the Canonical decomposition (NFD) algorithm
described by the Unicode specification 12.1, or higher, combined with
the elimination of ignorable code points (NFDi) and full
case-folding (CF) as documented in fs/unicode/utf8_norm.c.

NFD seems to be the best normalization method for EXT4 because:

  - It has a lower cost than NFC/NFKC (which requires
    decomposing to NFD as an intermediary step)
  - It doesn't eliminate important semantic meaning like
    compatibility decompositions.

Although:

  - This implementation is not completely linguistic accurate, because
  different languages have conflicting rules, which would require the
  specialization of the filesystem to a given locale, which brings all
  sorts of problems for removable media and for users who use more than
  one language.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25 14:12:08 -04:00
Theodore Ts'o
345c0dbf3a ext4: protect journal inode's blocks using block_validity
Add the blocks which belong to the journal inode to block_validity's
system zone so attempts to deallocate or overwrite the journal due a
corrupted file system where the journal blocks are also claimed by
another inode.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202879
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2019-04-09 23:37:08 -04:00
Arnd Bergmann
1e83bc8156 ext4: use BUG() instead of BUG_ON(1)
BUG_ON(1) leads to bogus warnings from clang when
CONFIG_PROFILE_ANNOTATED_BRANCHES is set:

 fs/ext4/inode.c:544:4: error: variable 'retval' is used uninitialized whenever 'if' condition is false
      [-Werror,-Wsometimes-uninitialized]
                        BUG_ON(1);
                        ^~~~~~~~~
 include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON'
                                   ^~~~~~~~~~~~~~~~~~~
 include/linux/compiler.h:48:23: note: expanded from macro 'unlikely'
                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 fs/ext4/inode.c:591:6: note: uninitialized use occurs here
        if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
            ^~~~~~
 fs/ext4/inode.c:544:4: note: remove the 'if' if its condition is always true
                        BUG_ON(1);
                        ^
 include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON'
                               ^
 fs/ext4/inode.c:502:12: note: initialize the variable 'retval' to silence this warning

Change it to BUG() so clang can see that this code path can never
continue.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-04-07 12:24:43 -04:00
Linus Torvalds
17403fa277 Miscellaneous ext4 bug fixes for 5.1.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlyX1hcACgkQ8vlZVpUN
 gaOkAAf8Cx3fIkMhBQ6iZABkE8om8LCw0cd1eBi06ode9mV/GQK3mSWrfTmaPzPn
 3OysYuBUmKSB4Bbzq8pty6yujlUOhTVFIL+wL3GCbRfYhS2a4u9hCJ5jyVsXCo+M
 IPIxGlgnDiYEppj3Ok3r6HgijSzLRw1jIgSCLW5N22aYikpTETWekFDNMoAHraYF
 ew6VRATCMhqa5CTWjEl+v9dnQ13jPzcWSW8BiHQCHBMV98NFciszPlnBHH45PUzF
 90ktCxYySopfpaegECAsRp5TgvfK32ZnxlbcRgTjaj5P7u+KkGOPp7DGLiyS3Ih7
 fK/MFaWytT7/IVOGlC07rpmYq7T2ww==
 =9gOf
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Miscellaneous ext4 bug fixes for 5.1"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: prohibit fstrim in norecovery mode
  ext4: cleanup bh release code in ext4_ind_remove_space()
  ext4: brelse all indirect buffer in ext4_ind_remove_space()
  ext4: report real fs size after failed resize
  ext4: add missing brelse() in add_new_gdb_meta_bg()
  ext4: remove useless ext4_pin_inode()
  ext4: avoid panic during forced reboot
  ext4: fix data corruption caused by unaligned direct AIO
  ext4: fix NULL pointer dereference while journal is aborted
2019-03-24 13:41:37 -07:00
Jason Yan
7cf7714077 ext4: remove useless ext4_pin_inode()
This function is never used from the beginning (and is commented out);
let's remove it.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-03-14 23:51:13 -04:00
Linus Torvalds
a5adcfcad5 A large number of bug fixes and cleanups. One new feature to allow
users to more easily find the jbd2 journal thread for a particular
 ext4 file system.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlx8utQACgkQ8vlZVpUN
 gaOMOQf+Olp6hTbCuPJNill7npEejkPu9VhNvLPp3dLPBfsyqG9IOZmUaKKtr3LS
 ZYYzMMoIlbHDsWM70O92zDS3s1ThKRFoDdcw4YKXkn1Awlqc4LRZ/NnzyIIdA3mK
 rhOvcr6ttWk2B2S67nGceTH08SX5zACMtMiQijP58+GCp4Xe+PdqPRRjYYJSOZMv
 xCS43LoWY0tkeBTQuk9WYTi6G/E1X/aiq06pYiQzP69PotN6/cFSdNgP1r+7dYiS
 V4IXPqEqFt8NvUZb1bJchT3+2zM3Xi/+n//7yLkpY7OhX6p1p24oB7abMstp3ssU
 BlF8KP4elQcI892QX2Hf+0r4tBu+0w==
 =2yLu
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "A large number of bug fixes and cleanups.

  One new feature to allow users to more easily find the jbd2 journal
  thread for a particular ext4 file system"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
  jbd2: jbd2_get_transaction does not need to return a value
  jbd2: fix invalid descriptor block checksum
  ext4: fix bigalloc cluster freeing when hole punching under load
  ext4: add sysfs attr /sys/fs/ext4/<disk>/journal_task
  ext4: Change debugging support help prefix from EXT4 to Ext4
  ext4: fix compile error when using BUFFER_TRACE
  jbd2: fix compile warning when using JBUFFER_TRACE
  ext4: fix some error pointer dereferences
  ext4: annotate more implicit fall throughs
  ext4: annotate implicit fall throughs
  ext4: don't update s_rev_level if not required
  jbd2: fold jbd2_superblock_csum_{verify,set} into their callers
  jbd2: fix race when writing superblock
  ext4: fix crash during online resizing
  ext4: disallow files with EXT4_JOURNAL_DATA_FL from EXT4_IOC_SWAP_BOOT
  ext4: add mask of ext4 flags to swap
  ext4: update quota information while swapping boot loader inode
  ext4: cleanup pagecache before swap i_data
  ext4: fix check of inode in swap_inode_boot_loader
  ext4: unlock unused_pages timely when doing writeback
  ...
2019-03-12 15:03:21 -07:00
zhangyi (F)
ddccb6dbe7 ext4: fix compile error when using BUFFER_TRACE
Fix compile error below when using BUFFER_TRACE.

fs/ext4/inode.c: In function ‘ext4_expand_extra_isize’:
fs/ext4/inode.c:5979:19: error: request for member ‘bh’ in something not a structure or union
  BUFFER_TRACE(iloc.bh, "get_write_access");

Fixes: c03b45b853 ("ext4, project: expand inode extra size if possible")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-02-21 11:29:10 -05:00
Andreas Dilger
c9e716eb9b ext4: don't update s_rev_level if not required
Don't update the superblock s_rev_level during mount if it isn't
actually necessary, only if superblock features are being set by
the kernel.  This was originally added for ext3 since it always
set the INCOMPAT_RECOVER and HAS_JOURNAL features during mount,
but this is not needed since no journal mode was added to ext4.

That will allow Geert to mount his 20-year-old ext2 rev 0.0 m68k
filesystem, as a testament of the backward compatibility of ext4.

Fixes: 0390131ba8 ("ext4: Allow ext4 to run without a journal")
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-02-14 17:52:18 -05:00
Xiaoguang Wang
a297b2fcee ext4: unlock unused_pages timely when doing writeback
In mpage_add_bh_to_extent(), when accumulated extents length is greater
than MAX_WRITEPAGES_EXTENT_LEN or buffer head's b_stat is not equal, we
will not continue to search unmapped area for this page, but note this
page is locked, and will only be unlocked in mpage_release_unused_pages()
after ext4_io_submit, if io also is throttled by blk-throttle or similar
io qos, we will hold this page locked for a while, it's unnecessary.

I think the best fix is to refactor mpage_add_bh_to_extent() to let it
return some hints whether to unlock this page, but given that we will
improve dioread_nolock later, we can let it done later, so currently
the simple fix would just call mpage_release_unused_pages() before
ext4_io_submit().

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-02-10 23:53:21 -05:00
zhangyi (F)
16e08b14a4 ext4: cleanup clean_bdev_aliases() calls
Now, we have already handle all cases of forgetting buffer in
jbd2_journal_forget(), the buffer should not be mapped to blockdevice
when reallocating it. So this patch remove all clean_bdev_aliases() and
clean_bdev_bh_alias() calls which were invoked by ext4 explicitly.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-02-10 23:32:07 -05:00
Nikolay Borisov
82dd124c40 ext4: replace opencoded i_writecount usage with inode_is_open_for_write()
There is a function which clearly conveys the objective of checking
i_writecount. Additionally the usage in ext4_mb_initialize_context was
wrong, since a node would have wrongfully been reported as writable if
i_writecount had a negative value (MMAP_DENY_WRITE).

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-02-10 23:04:16 -05:00
Chandan Rajendra
643fa9612b fscrypt: remove filesystem specific build config option
In order to have a common code base for fscrypt "post read" processing
for all filesystems which support encryption, this commit removes
filesystem specific build config option (e.g. CONFIG_EXT4_FS_ENCRYPTION)
and replaces it with a build option (i.e. CONFIG_FS_ENCRYPTION) whose
value affects all the filesystems making use of fscrypt.

Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-01-23 23:56:43 -05:00
Chandan Rajendra
592ddec757 ext4: use IS_ENCRYPTED() to check encryption status
This commit removes the ext4 specific ext4_encrypted_inode() and makes
use of the generic IS_ENCRYPTED() macro to check for the encryption
status of an inode.

Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-01-23 23:56:43 -05:00
Theodore Ts'o
191ce17876 ext4: fix special inode number checks in __ext4_iget()
The check for special (reserved) inode number checks in __ext4_iget()
was broken by commit 8a363970d1: ("ext4: avoid declaring fs
inconsistent due to invalid file handles").  This was caused by a
botched reversal of the sense of the flag now known as
EXT4_IGET_SPECIAL (when it was previously named EXT4_IGET_NORMAL).
Fix the logic appropriately.

Fixes: 8a363970d1 ("ext4: avoid declaring fs inconsistent...")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@kernel.org
2018-12-31 22:34:31 -05:00
Theodore Ts'o
812c0cab2c ext4: make sure enough credits are reserved for dioread_nolock writes
There are enough credits reserved for most dioread_nolock writes;
however, if the extent tree is sufficiently deep, and/or quota is
enabled, the code was not allowing for all eventualities when
reserving journal credits for the unwritten extent conversion.

This problem can be seen using xfstests ext4/034:

   WARNING: CPU: 1 PID: 257 at fs/ext4/ext4_jbd2.c:271 __ext4_handle_dirty_metadata+0x10c/0x180
   Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
   RIP: 0010:__ext4_handle_dirty_metadata+0x10c/0x180
   	...
   EXT4-fs: ext4_free_blocks:4938: aborting transaction: error 28 in __ext4_handle_dirty_metadata
   EXT4: jbd2_journal_dirty_metadata failed: handle type 11 started at line 4921, credits 4/0, errcode -28
   EXT4-fs error (device dm-1) in ext4_free_blocks:4950: error 28

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-12-24 20:27:08 -05:00
Theodore Ts'o
18f2c4fceb ext4: check for shutdown and r/o file system in ext4_write_inode()
If the file system has been shut down or is read-only, then
ext4_write_inode() needs to bail out early.

Also use jbd2_complete_transaction() instead of ext4_force_commit() so
we only force a commit if it is needed.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-12-19 14:36:58 -05:00
Theodore Ts'o
8a363970d1 ext4: avoid declaring fs inconsistent due to invalid file handles
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.

This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.

#define _GNU_SOURCE
#include <fcntl.h>

struct handle {
	struct file_handle fh;
	unsigned char fid[MAX_HANDLE_SZ];
};

int main(int argc, char **argv)
{
	struct handle h = {{8, 1 }, { 12, }};

	open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
	return 0;
}

Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-12-19 12:29:13 -05:00
Linus Torvalds
c140f8b072 A large number of ext4 bug fixes, mostly buffer and memory leaks on
error return cleanup paths.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlvoFrEACgkQ8vlZVpUN
 gaMTSQf+Ogrvm7pfWtXf+RkmhhuyR26T+Hwxgl51m5bKetJBjEsh0qOaIfo7etwG
 aLc1x/pWng2VTCHk4z0Ij9KS8YwLK3sQCBYZoJFyT/R09yGgAhLm+xP5j38WLqrX
 h4GxVgekHSATkG95N/So7F7pQiz7gDowgbaYFW3PooXPoHJnCnTzcr7TGFAQBZAw
 iR+8+KtH5E8IcC7Jj40nemk7Wib45DgaeGpP5P9Ct/Jw7hW+Mwhf56NYOWkLdHyy
 4Kt7rm1Sbxam8k3nksNmIwx28bw+S0Ew1zZgkwgAcKcHaWdrv3TtGPkOA26AH+S3
 UVeORM7xH+zXslIOyFK+7sXUZr5LiQ==
 =BaBl
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "A large number of ext4 bug fixes, mostly buffer and memory leaks on
  error return cleanup paths"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: missing !bh check in ext4_xattr_inode_write()
  ext4: fix buffer leak in __ext4_read_dirblock() on error path
  ext4: fix buffer leak in ext4_expand_extra_isize_ea() on error path
  ext4: fix buffer leak in ext4_xattr_move_to_block() on error path
  ext4: release bs.bh before re-using in ext4_xattr_block_find()
  ext4: fix buffer leak in ext4_xattr_get_block() on error path
  ext4: fix possible leak of s_journal_flag_rwsem in error path
  ext4: fix possible leak of sbi->s_group_desc_leak in error path
  ext4: remove unneeded brelse call in ext4_xattr_inode_update_ref()
  ext4: avoid possible double brelse() in add_new_gdb() on error path
  ext4: avoid buffer leak in ext4_orphan_add() after prior errors
  ext4: avoid buffer leak on shutdown in ext4_mark_iloc_dirty()
  ext4: fix possible inode leak in the retry loop of ext4_resize_fs()
  ext4: fix missing cleanup if ext4_alloc_flex_bg_array() fails while resizing
  ext4: add missing brelse() update_backups()'s error path
  ext4: add missing brelse() add_new_gdb_meta_bg()'s error path
  ext4: add missing brelse() in set_flexbg_block_bitmap()'s error path
  ext4: avoid potential extra brelse in setup_new_flex_group_blocks()
2018-11-11 16:53:02 -06:00
Vasily Averin
a6758309a0 ext4: avoid buffer leak on shutdown in ext4_mark_iloc_dirty()
ext4_mark_iloc_dirty() callers expect that it releases iloc->bh
even if it returns an error.

Fixes: 0db1ff222d ("ext4: add shutdown bit and check for it")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.11
2018-11-06 16:49:50 -05:00
Linus Torvalds
dad4f140ed Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax
Pull XArray conversion from Matthew Wilcox:
 "The XArray provides an improved interface to the radix tree data
  structure, providing locking as part of the API, specifying GFP flags
  at allocation time, eliminating preloading, less re-walking the tree,
  more efficient iterations and not exposing RCU-protected pointers to
  its users.

  This patch set

   1. Introduces the XArray implementation

   2. Converts the pagecache to use it

   3. Converts memremap to use it

  The page cache is the most complex and important user of the radix
  tree, so converting it was most important. Converting the memremap
  code removes the only other user of the multiorder code, which allows
  us to remove the radix tree code that supported it.

  I have 40+ followup patches to convert many other users of the radix
  tree over to the XArray, but I'd like to get this part in first. The
  other conversions haven't been in linux-next and aren't suitable for
  applying yet, but you can see them in the xarray-conv branch if you're
  interested"

* 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
  radix tree: Remove multiorder support
  radix tree test: Convert multiorder tests to XArray
  radix tree tests: Convert item_delete_rcu to XArray
  radix tree tests: Convert item_kill_tree to XArray
  radix tree tests: Move item_insert_order
  radix tree test suite: Remove multiorder benchmarking
  radix tree test suite: Remove __item_insert
  memremap: Convert to XArray
  xarray: Add range store functionality
  xarray: Move multiorder_check to in-kernel tests
  xarray: Move multiorder_shrink to kernel tests
  xarray: Move multiorder account test in-kernel
  radix tree test suite: Convert iteration test to XArray
  radix tree test suite: Convert tag_tagged_items to XArray
  radix tree: Remove radix_tree_clear_tags
  radix tree: Remove radix_tree_maybe_preload_order
  radix tree: Remove split/join code
  radix tree: Remove radix_tree_update_node_t
  page cache: Finish XArray conversion
  dax: Convert page fault handlers to XArray
  ...
2018-10-28 11:35:40 -07:00
Matthew Wilcox
10bbd23585 pagevec: Use xa_mark_t
Removes sparse warnings.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:39 -04:00
Souptick Joarder
401b25aa1a ext4: convert fault handler to use vm_fault_t type
Return type of ext4_page_mkwrite and ext4_filemap_fault are
changed to use vm_fault_t type.

With this patch all the callers of block_page_mkwrite_return()
are changed to handle vm_fault_t. So converting the return type
of block_page_mkwrite_return() to vm_fault_t.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
2018-10-02 22:20:50 -04:00
Eric Whitney
f456767d33 ext4: fix reserved cluster accounting at page invalidation time
Add new code to count canceled pending cluster reservations on bigalloc
file systems and to reduce the cluster reservation count on all file
systems using delayed allocation.  This replaces old code in
ext4_da_page_release_reservations that was incorrect.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:33:24 -04:00
Eric Whitney
0b02f4c0d6 ext4: fix reserved cluster accounting at delayed write time
The code in ext4_da_map_blocks sometimes reserves space for more
delayed allocated clusters than it should, resulting in premature
ENOSPC, exceeded quota, and inaccurate free space reporting.

Fix this by checking for written and unwritten blocks shared in the
same cluster with the newly delayed allocated block.  A cluster
reservation should not be made for a cluster for which physical space
has already been allocated.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:19:37 -04:00
Eric Whitney
ad431025ae ext4: generalize extents status tree search functions
Ext4 contains a few functions that are used to search for delayed
extents or blocks in the extents status tree.  Rather than duplicate
code to add new functions to search for extents with different status
values, such as written or a combination of delayed and unwritten,
generalize the existing code to search for caller-specified extents
status values.  Also, move this code into extents_status.c where it
is better associated with the data structures it operates upon, and
where it can be more readily used to implement new extents status tree
functions that might want a broader scope for i_es_lock.

Three missing static specifiers in RFC version of patch reported and
fixed by Fengguang Wu <fengguang.wu@intel.com>.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:10:39 -04:00
Greg Kroah-Hartman
ad3273d5f1 Various ext4 bug fixes; primarily making ext4 more robust against
maliciously crafted file systems, and some DAX fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlufGncACgkQ8vlZVpUN
 gaPwuQf9FKp9yRvjBkjtnH3+s4Ps8do9r067+90y1k2DJMxKoaBUhGSW2MJJ04j+
 5F6Ndp/TZHw+LfPnzsqlrAAoP3CG5+kacfJ7xeVKR0umvACm6rLMsCUct7/rFoSl
 PgzCALFIJvQ9+9shuO9qrgmjJrfrlTVUgR9Mu3WUNEvMFbMjk3FMI8gi5kjjWemE
 G9TDYH2lMH2sL0cWF51I2gOyNXOXrihxe+vP7j6i/rUkV+YLpKZhE1ss3Sfn6pR2
 p/KjnXdupLJpgYLJne9kMrq2r8xYmDfA0S+Dec7nkox5FUOFUHssl3+q8C7cDwO9
 zl6VyVFwybjFRJ/Y59wox6eqVPlIWw==
 =1P1w
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Ted writes:
	Various ext4 bug fixes; primarily making ext4 more robust against
	maliciously crafted file systems, and some DAX fixes.

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4, dax: set ext4_dax_aops for dax files
  ext4, dax: add ext4_bmap to ext4_dax_aops
  ext4: don't mark mmp buffer head dirty
  ext4: show test_dummy_encryption mount option in /proc/mounts
  ext4: close race between direct IO and ext4_break_layouts()
  ext4: fix online resizing for bigalloc file systems with a 1k block size
  ext4: fix online resize's handling of a too-small final block group
  ext4: recalucate superblock checksum after updating free blocks/inodes
  ext4: avoid arithemetic overflow that can trigger a BUG
  ext4: avoid divide by zero fault when deleting corrupted inline directories
  ext4: check to make sure the rename(2)'s destination is not freed
  ext4: add nonstring annotations to ext4.h
2018-09-17 09:13:47 +02:00
Toshi Kani
cce6c9f7e6 ext4, dax: set ext4_dax_aops for dax files
Sync syscall to DAX file needs to flush processor cache, but it
currently does not flush to existing DAX files.  This is because
'ext4_da_aops' is set to address_space_operations of existing DAX
files, instead of 'ext4_dax_aops', since S_DAX flag is set after
ext4_set_aops() in the open path.

  New file
  --------
  lookup_open
    ext4_create
      __ext4_new_inode
        ext4_set_inode_flags   // Set S_DAX flag
      ext4_set_aops            // Set aops to ext4_dax_aops

  Existing file
  -------------
  lookup_open
    ext4_lookup
      ext4_iget
        ext4_set_aops          // Set aops to ext4_da_aops
        ext4_set_inode_flags   // Set S_DAX flag

Change ext4_iget() to initialize i_flags before ext4_set_aops().

Fixes: 5f0663bb4a ("ext4, dax: introduce ext4_dax_aops")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2018-09-15 21:37:59 -04:00
Toshi Kani
94dbb63117 ext4, dax: add ext4_bmap to ext4_dax_aops
Ext4 mount path calls .bmap to the journal inode. This currently
works for the DAX mount case because ext4_iget() always set
'ext4_da_aops' to any regular files.

In preparation to fix ext4_iget() to set 'ext4_dax_aops' for ext4
DAX files, add ext4_bmap() to 'ext4_dax_aops', since bmap works for
DAX inodes.

Fixes: 5f0663bb4a ("ext4, dax: introduce ext4_dax_aops")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2018-09-15 21:23:41 -04:00
Ross Zwisler
b1f382178d ext4: close race between direct IO and ext4_break_layouts()
If the refcount of a page is lowered between the time that it is returned
by dax_busy_page() and when the refcount is again checked in
ext4_break_layouts() => ___wait_var_event(), the waiting function
ext4_wait_dax_page() will never be called.  This means that
ext4_break_layouts() will still have 'retry' set to false, so we'll stop
looping and never check the refcount of other pages in this inode.

Instead, always continue looping as long as dax_layout_busy_page() gives us
a page which it found with an elevated refcount.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-11 13:31:16 -04:00
Theodore Ts'o
bcd8e91f98 ext4: avoid arithemetic overflow that can trigger a BUG
A maliciously crafted file system can cause an overflow when the
results of a 64-bit calculation is stored into a 32-bit length
parameter.

https://bugzilla.kernel.org/show_bug.cgi?id=200623

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
2018-09-01 12:45:04 -04:00
Jens Axboe
ac22b46a0b ext4: readpages() should submit IO as read-ahead
a_ops->readpages() is only ever used for read-ahead.  Ensure that we
pass this information down to the block layer.

Link: http://lkml.kernel.org/r/20180621010725.17813-5-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Liu Song
bc71652346 ext4: improve code readability in ext4_iget()
Merge the duplicated complex conditions to improve code readability.

Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
2018-08-02 00:11:16 -04:00
Ross Zwisler
430657b6be ext4: handle layout changes to pinned DAX mappings
Follow the lead of xfs_break_dax_layouts() and add synchronization between
operations in ext4 which remove blocks from an inode (hole punch, truncate
down, etc.) and pages which are pinned due to DAX DMA operations.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2018-07-29 17:00:22 -04:00
Arnd Bergmann
5ffff83432 ext4: use ktime_get_real_seconds for i_dtime
We only care about the low 32-bit for i_dtime as explained in commit
b5f515735b ("ext4: avoid Y2038 overflow in recently_deleted()"), so
the use of get_seconds() is correct here, but that function is getting
removed in the process of the y2038 fixes, so let's use the modern
ktime_get_real_seconds() here.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-07-29 15:50:00 -04:00
Theodore Ts'o
362eca70b5 ext4: fix inline data updates with checksums enabled
The inline data code was updating the raw inode directly; this is
problematic since if metadata checksums are enabled,
ext4_mark_inode_dirty() must be called to update the inode's checksum.
In addition, the jbd2 layer requires that get_write_access() be called
before the metadata buffer is modified.  Fix both of these problems.

https://bugzilla.kernel.org/show_bug.cgi?id=200443

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-07-10 01:07:43 -04:00
Theodore Ts'o
c37e9e0134 ext4: add more inode number paranoia checks
If there is a directory entry pointing to a system inode (such as a
journal inode), complain and declare the file system to be corrupted.

Also, if the superblock's first inode number field is too small,
refuse to mount the file system.

This addresses CVE-2018-10882.

https://bugzilla.kernel.org/show_bug.cgi?id=200069

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-06-17 00:41:14 -04:00
Theodore Ts'o
bdbd6ce01a ext4: include the illegal physical block in the bad map ext4_error msg
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-06-15 12:27:16 -04:00
Theodore Ts'o
eb9b5f01c3 ext4: bubble errors from ext4_find_inline_data_nolock() up to ext4_iget()
If ext4_find_inline_data_nolock() returns an error it needs to get
reflected up to ext4_iget().  In order to fix this,
ext4_iget_extra_inode() needs to return an error (and not return
void).

This is related to "ext4: do not allow external inodes for inline
data" (which fixes CVE-2018-11412) in that in the errors=continue
case, it would be useful to for userspace to receive an error
indicating that file system is corrupted.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
2018-05-22 17:14:07 -04:00
Lukas Czerner
eee597ac93 ext4: update mtime in ext4_punch_hole even if no blocks are released
Currently in ext4_punch_hole we're going to skip the mtime update if
there are no actual blocks to release. However we've actually modified
the file by zeroing the partial block so the mtime should be updated.

Moreover the sync and datasync handling is skipped as well, which is
also wrong. Fix it.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Joe Habermann <joe.habermann@quantum.com>
Cc: <stable@vger.kernel.org>
2018-05-13 19:28:35 -04:00
Luis R. Rodriguez
6390d33bf5 ext4: add verifier check for symlink with append/immutable flags
The Linux VFS does not allow a way to set append/immuttable
attributes to symlinks, this is just not possible. If this is
detected inform the user as the filesystem must be corrupted.

Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-05-13 16:45:56 -04:00
Eryu Guan
e254d1afac ext4: use raw i_version value for ea_inode
Currently, creating large xattr (e.g. 2k) in ea_inode would cause
ea_inode refcount corruption, e.g.

  Pass 4: Checking reference counts
  Extended attribute inode 13 ref count is 0, should be 1. Fix? no

This is because that we save the lower 32bit of refcount in
inode->i_version and store it in raw_inode->i_disk_version on disk.
But since commit ee73f9a52a ("ext4: convert to new i_version
API"), we load/store modified i_disk_version from/to disk instead of
raw value, which causes on-disk ea_inode refcount corruption.

Fix it by loading/storing raw i_version/i_disk_version, because it's
a self-managed value in this case.

Fixes: ee73f9a52a ("ext4: convert to new i_version API")
Cc: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-05-10 11:55:31 -04:00
Linus Torvalds
9f3a0941fb libnvdimm for 4.17
* A rework of the filesytem-dax implementation provides for detection of
   unmap operations (truncate / hole punch) colliding with in-progress
   device-DMA. A fix for these collisions remains a work-in-progress
   pending resolution of truncate latency and starvation regressions.
 
 * The of_pmem driver expands the users of libnvdimm outside of x86 and
   ACPI to describe an implementation of persistent memory on PowerPC with
   Open Firmware / Device tree.
 
 * Address Range Scrub (ARS) handling is completely rewritten to account for
   the fact that ARS may run for 100s of seconds and there is no platform
   defined way to cancel it. ARS will now no longer block namespace
   initialization.
 
 * The NVDIMM Namespace Label implementation is updated to handle label
   areas as small as 1K, down from 128K.
 
 * Miscellaneous cleanups and updates to unit test infrastructure.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJazDt5AAoJEB7SkWpmfYgCqGMQALLwdPeY87cUK7AvQ2IXj46B
 lJgeVuHPzyQDbC03AS5uUYnnU3I5lFd7i4y7ZrywNpFs4lsb/bNmbUpQE5xp+Yvc
 1MJ/JYDIP5X4misWYm3VJo85N49+VqSRgAQk52PBigwnZ7M6/u4cSptXM9//c9JL
 /NYbat6IjjY6Tx49Tec6+F3GMZjsFLcuTVkQcREoOyOqVJE4YpP0vhNjEe0vq6vr
 EsSWiqEI5VFH4PfJwKdKj/64IKB4FGKj2A5cEgjQBxW2vw7tTJnkRkdE3jDUjqtg
 xYAqGp/Dqs4+bgdYlT817YhiOVrcr5mOHj7TKWQrBPgzKCbcG5eKDmfT8t+3NEga
 9kBlgisqIcG72lwZNA7QkEHxq1Omy9yc1hUv9qz2YA0G+J1WE8l1T15k1DOFwV57
 qIrLLUypklNZLxvrzNjclempboKc4JCUlj+TdN5E5Y6pRs55UWTXaP7Xf5O7z0vf
 l/uiiHkc3MPH73YD2PSEGFJ8m8EU0N8xhrcz3M9E2sHgYCnbty1Lw3FH0/GhThVA
 ya1mMeDdb8A2P7gWCBk1Lqeig+rJKXSey4hKM6D0njOEtMQO1H4tFqGjyfDX1xlJ
 3plUR9WBVEYzN5+9xWbwGag/ezGZ+NfcVO2gmy6yXiEph796BxRAZx/18zKRJr0m
 9eGJG1H+JspcbtLF9iHn
 =acZQ
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This cycle was was not something I ever want to repeat as there were
  several late changes that have only now just settled.

  Half of the branch up to commit d2c997c0f1 ("fs, dax: use
  page->mapping to warn...") have been in -next for several releases.
  The of_pmem driver and the address range scrub rework were late
  arrivals, and the dax work was scaled back at the last moment.

  The of_pmem driver missed a previous merge window due to an oversight.
  A sense of obligation to rectify that miss is why it is included for
  4.17. It has acks from PowerPC folks. Stephen reported a build failure
  that only occurs when merging it with your latest tree, for now I have
  fixed that up by disabling modular builds of of_pmem. A test merge
  with your tree has received a build success report from the 0day robot
  over 156 configs.

  An initial version of the ARS rework was submitted before the merge
  window. It is self contained to libnvdimm, a net code reduction, and
  passing all unit tests.

  The filesystem-dax changes are based on the wait_var_event()
  functionality from tip/sched/core. However, late review feedback
  showed that those changes regressed truncate performance to a large
  degree. The branch was rewound to drop the truncate behavior change
  and now only includes preparation patches and cleanups (with full acks
  and reviews). The finalization of this dax-dma-vs-trnucate work will
  need to wait for 4.18.

  Summary:

   - A rework of the filesytem-dax implementation provides for detection
     of unmap operations (truncate / hole punch) colliding with
     in-progress device-DMA. A fix for these collisions remains a
     work-in-progress pending resolution of truncate latency and
     starvation regressions.

   - The of_pmem driver expands the users of libnvdimm outside of x86
     and ACPI to describe an implementation of persistent memory on
     PowerPC with Open Firmware / Device tree.

   - Address Range Scrub (ARS) handling is completely rewritten to
     account for the fact that ARS may run for 100s of seconds and there
     is no platform defined way to cancel it. ARS will now no longer
     block namespace initialization.

   - The NVDIMM Namespace Label implementation is updated to handle
     label areas as small as 1K, down from 128K.

   - Miscellaneous cleanups and updates to unit test infrastructure"

* tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
  libnvdimm, of_pmem: workaround OF_NUMA=n build error
  nfit, address-range-scrub: add module option to skip initial ars
  nfit, address-range-scrub: rework and simplify ARS state machine
  nfit, address-range-scrub: determine one platform max_ars value
  powerpc/powernv: Create platform devs for nvdimm buses
  doc/devicetree: Persistent memory region bindings
  libnvdimm: Add device-tree based driver
  libnvdimm: Add of_node to region and bus descriptors
  libnvdimm, region: quiet region probe
  libnvdimm, namespace: use a safe lookup for dimm device name
  libnvdimm, dimm: fix dpa reservation vs uninitialized label area
  libnvdimm, testing: update the default smart ctrl_temperature
  libnvdimm, testing: Add emulation for smart injection commands
  nfit, address-range-scrub: introduce nfit_spa->ars_state
  libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
  nfit, address-range-scrub: fix scrub in-progress reporting
  dax, dm: allow device-mapper to operate without dax support
  dax: introduce CONFIG_DAX_DRIVER
  fs, dax: use page->mapping to warn if truncate collides with a busy page
  ext2, dax: introduce ext2_dax_aops
  ...
2018-04-10 10:25:57 -07:00
Linus Torvalds
9022ca6b11 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff, including Christoph's I_DIRTY patches"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: move I_DIRTY_INODE to fs.h
  ubifs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
  ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
  gfs2: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) calls
  fs: fold open_check_o_direct into do_dentry_open
  vfs: Replace stray non-ASCII homoglyph characters with their ASCII equivalents
  vfs: make sure struct filename->iname is word-aligned
  get rid of pointless includes of fs_struct.h
  [poll] annotate SAA6588_CMD_POLL users
2018-04-06 11:07:08 -07:00
Dan Williams
5f0663bb4a ext4, dax: introduce ext4_dax_aops
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-03-30 11:34:55 -07:00
Theodore Ts'o
8e4b5eae5d ext4: fail ext4_iget for root directory if unallocated
If the root directory has an i_links_count of zero, then when the file
system is mounted, then when ext4_fill_super() notices the problem and
tries to call iput() the root directory in the error return path,
ext4_evict_inode() will try to free the inode on disk, before all of
the file system structures are set up, and this will result in an OOPS
caused by a NULL pointer dereference.

This issue has been assigned CVE-2018-1092.

https://bugzilla.kernel.org/show_bug.cgi?id=199179
https://bugzilla.redhat.com/show_bug.cgi?id=1560777

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-03-29 21:56:09 -04:00
Christoph Hellwig
0e11f6443f fs: move I_DIRTY_INODE to fs.h
And use it in a few more places rather than opencoding the values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-28 01:39:02 -04:00
Goldwyn Rodrigues
043d20d159 ext4: use generic_writepages instead of __writepage/write_cache_pages
Code cleanup. Instead of writing an internal static function, use the
available generic_writepages().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-03-26 01:32:50 -04:00
Nikolay Borisov
1d39834fba ext4: remove EXT4_STATE_DIOREAD_LOCK flag
Commit 16c5468859 ("ext4: Allow parallel DIO reads") reworked the way
locking happens around parallel dio reads. This resulted in obviating
the need for EXT4_STATE_DIOREAD_LOCK flag and accompanying logic.
Currently this amounts to dead code so let's remove it. No functional
changes

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-03-22 11:52:10 -04:00
Jiri Slaby
fe23cb65c2 ext4: fix offset overflow on 32-bit archs in ext4_iomap_begin()
ext4_iomap_begin() has a bug where offset returned in the iomap
structure will be truncated to unsigned long size. On 64-bit
architectures this is fine but on 32-bit architectures obviously not.
Not many places actually use the offset stored in the iomap structure
but one of visible failures is in SEEK_HOLE / SEEK_DATA implementation.
If we create a file like:

dd if=/dev/urandom of=file bs=1k seek=8m count=1

then

lseek64("file", 0x100000000ULL, SEEK_DATA)

wrongly returns 0x100000000 on unfixed kernel while it should return
0x200000000. Avoid the overflow by proper type cast.

Fixes: 545052e9e3 ("ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA")
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org # v4.15
2018-03-22 11:50:26 -04:00
Eryu Guan
45d8ec4d9f ext4: update i_disksize if direct write past ondisk size
Currently in ext4 direct write path, we update i_disksize only when
new eof is greater than i_size, and don't update it even when new
eof is greater than i_disksize but less than i_size. This doesn't
work well with delalloc buffer write, which updates i_size and
i_disksize only when delalloc blocks are resolved (at writeback
time), the i_disksize from direct write can be lost if a previous
buffer write succeeded at write time but failed at writeback time,
then results in corrupted ondisk inode size.

Consider this case, first buffer write 4k data to a new file at
offset 16k with delayed allocation, then direct write 4k data to the
same file at offset 4k before delalloc blocks are resolved, which
doesn't update i_disksize because it writes within i_size(20k), but
the extent tree metadata has been committed in journal. Then
writeback of the delalloc blocks fails (due to device error etc.),
and i_size/i_disksize from buffer write can't be written to disk
(still zero). A subsequent umount/mount cycle recovers journal and
writes extent tree metadata from direct write to disk, but with
i_disksize being zero.

Fix it by updating i_disksize too in direct write path when new eof
is greater than i_disksize but less than i_size, so i_disksize is
always consistent with direct write.

This fixes occasional i_size corruption in fstests generic/475.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-03-22 11:44:59 -04:00
Eryu Guan
73fdad00b2 ext4: protect i_disksize update by i_data_sem in direct write path
i_disksize update should be protected by i_data_sem, by either taking
the lock explicitly or by using ext4_update_i_disksize() helper. But the
i_disksize updates in ext4_direct_IO_write() are not protected at all,
which may be racing with i_disksize updates in writeback path in
delalloc buffer write path.

This is found by code inspection, and I didn't hit any i_disksize
corruption due to this bug. Thanks to Jan Kara for catching this bug and
suggesting the fix!

Reported-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-03-22 11:41:25 -04:00
Linus Torvalds
23aedc4b9b Only miscellaneous cleanups and bug fixes for ext4 this cycle.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlp16xMACgkQ8vlZVpUN
 gaP1IAf8C48AKVnqy6ftFphzV1CdeGHDwJLL63lChs97fNr1mxo5TZE/6vdYB55j
 k7C7huQ582cEiGWQJ0U4/+En0hF85zkAk5mTfnSao5BqxLr9ANsAocwBUNBXdFSp
 B7IyMo4Dct7NCkwfmKLPRcEqZ49vwyv99TqM/9wUkgUStkTjPT7bhHgarB6VPbhp
 BxoXVnFYgU0sZN0y71IBt8ngWqCK6j7fjw3gsl37oEenG3/h3SO0H9ih1FrysX8S
 VOwwLJq6vfAgEwQvZACnBwWKDYsZpH7akNp9WGeDMByo28t514RNRjIi0mvLHEZa
 h72I8Sb3bwHO9MJNvHFe/0b1Say4vw==
 =dxAX
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Only miscellaneous cleanups and bug fixes for ext4 this cycle"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: create ext4_kset dynamically
  ext4: create ext4_feat kobject dynamically
  ext4: release kobject/kset even when init/register fail
  ext4: fix incorrect indentation of if statement
  ext4: correct documentation for grpid mount option
  ext4: use 'sbi' instead of 'EXT4_SB(sb)'
  ext4: save error to disk in __ext4_grp_locked_error()
  jbd2: fix sphinx kernel-doc build warnings
  ext4: fix a race in the ext4 shutdown path
  mbcache: make sure c_entry_count is not decremented past zero
  ext4: no need flush workqueue before destroying it
  ext4: fixed alignment and minor code cleanup in ext4.h
  ext4: fix ENOSPC handling in DAX page fault handler
  dax: pass detailed error code from dax_iomap_fault()
  mbcache: revert "fs/mbcache.c: make count_objects() more robust"
  mbcache: initialize entry->e_referenced in mb_cache_entry_create()
  ext4: fix up remaining files with SPDX cleanups
2018-02-03 13:49:22 -08:00
Linus Torvalds
a4b7fd7d34 inode->i_version rework for v4.16
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJabwjlAAoJEAAOaEEZVoIVeEEP/R84kZJjlZV/vNmFFvY46jM+
 0hpMHXRNym+nW1Du1CKNkesEUAY8ACAQIyzJh63Q72341QTDdz3+asHwPYRNOqdC
 PgryidPieojkNKQg+h7dmoKYlYh1xiCicvn66Q5PFb9B0lH36twekOK4X1qqJj8Z
 breRmRoFLka9looMSuYgwbErts023fmASalvGum6T0ZM/7F9hUj4O3OsQtKTLUNM
 VQ+gLJTQrUqrgzvWUwq3WTMa9YAaKP4oad8nsglNSpiVLG7WtURr5HokW9hAziqL
 k99Y+K2ni1wZJlNGJAyV7PyEG2ieI5Xn+LzM2RM+SndD1QHF2QXACmSTDYfL51k5
 G2RsKeTZvQPtX4qx9+vnCp/4oV6JduvCaq2Mt8SQb9nYZxKjs85TNLrARJv+85eQ
 zP0OTxlH1Gfu3j36n3cny4XemyMYYF4hCFYfRPqTGst37fgLBtfIfUSQ6jedoCK2
 Xcyb6ukGXMh6If/A7DSy91hvSSPrWSH7TPPsbfLy6o+wUOtpAGR4eXVlEuAiXrzc
 gnoAz85oIMUQae66LrdrPk1NyE59qOb24g/yU5gyRBSpi2+/aoboNCKaD73tgs/C
 XIMwGXLYmqkcud7IBQF0tHHiM+jsEkbSM4LUqRXSnqMdwNnS18Z4Q+JKqpdP0cii
 eRdenDvUfu8Gu1Y9vWBv
 =iihN
 -----END PGP SIGNATURE-----

Merge tag 'iversion-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull inode->i_version rework from Jeff Layton:
 "This pile of patches is a rework of the inode->i_version field. We
  have traditionally incremented that field on every inode data or
  metadata change. Typically this increment needs to be logged on disk
  even when nothing else has changed, which is rather expensive.

  It turns out though that none of the consumers of that field actually
  require this behavior. The only real requirement for all of them is
  that it be different iff the inode has changed since the last time the
  field was checked.

  Given that, we can optimize away most of the i_version increments and
  avoid dirtying inode metadata when the only change is to the i_version
  and no one is querying it. Queries of the i_version field are rather
  rare, so we can help write performance under many common workloads.

  This patch series converts existing accesses of the i_version field to
  a new API, and then converts all of the in-kernel filesystems to use
  it. The last patch in the series then converts the backend
  implementation to a scheme that optimizes away a large portion of the
  metadata updates when no one is looking at it.

  In my own testing this series significantly helps performance with
  small I/O sizes. I also got this email for Christmas this year from
  the kernel test robot (a 244% r/w bandwidth improvement with XFS over
  DAX, with 4k writes):

    https://lkml.org/lkml/2017/12/25/8

  A few of the earlier patches in this pile are also flowing to you via
  other trees (mm, integrity, and nfsd trees in particular)".

* tag 'iversion-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: (22 commits)
  fs: handle inode->i_version more efficiently
  btrfs: only dirty the inode in btrfs_update_time if something was changed
  xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing
  fs: only set S_VERSION when updating times if necessary
  IMA: switch IMA over to new i_version API
  xfs: convert to new i_version API
  ufs: use new i_version API
  ocfs2: convert to new i_version API
  nfsd: convert to new i_version API
  nfs: convert to new i_version API
  ext4: convert to new i_version API
  ext2: convert to new i_version API
  exofs: switch to new i_version API
  btrfs: convert to new i_version API
  afs: convert to new i_version API
  affs: convert to new i_version API
  fat: convert to new i_version API
  fs: don't take the i_lock in inode_inc_iversion
  fs: new API for handling inode->i_version
  ntfs: remove i_version handling
  ...
2018-01-29 13:33:53 -08:00
Jeff Layton
ee73f9a52a ext4: convert to new i_version API
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
2018-01-29 06:42:21 -05:00
Jeff Layton
ae5e165d85 fs: new API for handling inode->i_version
Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.

We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.

Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-01-29 06:41:30 -05:00
Harshad Shirwadkar
abbc3f9395 ext4: fix a race in the ext4 shutdown path
This patch fixes a race between the shutdown path and bio completion
handling. In the ext4 direct io path with async io, after submitting a
bio to the block layer, if journal starting fails,
ext4_direct_IO_write() would bail out pretending that the IO
failed. The caller would have had no way of knowing whether or not the
IO was successfully submitted. So instead, we return -EIOCBQUEUED in
this case. Now, the caller knows that the IO was submitted.  The bio
completion handler takes care of the error.

Tested: Ran the shutdown xfstest test 461 in loop for over 2 hours across
4 machines resulting in over 400 runs. Verified that the race didn't
occur. Usually the race was seen in about 20-30 iterations.

Signed-off-by: Harshad Shirwadkar <harshads@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-01-10 00:13:13 -05:00
Andi Kleen
fc82228a5e ext4: support fast symlinks from ext3 file systems
407cd7fb83 (ext4: change fast symlink test to not rely on i_blocks)
broke ~10 years old ext3 file systems created by 2.6.17. Any ELF
executable fails because the /lib/ld-linux.so.2 fast symlink
cannot be read anymore.

The patch assumed fast symlinks were created in a specific way,
but that's not true on these really old file systems.

The new behavior is apparently needed only with the large EA inode
feature.

Revert to the old behavior if the large EA inode feature is not set.

This makes my old VM boot again.

Fixes: 407cd7fb83 (ext4: change fast symlink test to not rely on i_blocks)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2017-12-03 20:38:01 -05:00
Linus Torvalds
1751e8a6cb Rename superblock flags (MS_xyz -> SB_xyz)
This is a pure automated search-and-replace of the internal kernel
superblock flags.

The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.

Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.

The script to do this was:

    # places to look in; re security/*: it generally should *not* be
    # touched (that stuff parses mount(2) arguments directly), but
    # there are two places where we really deal with superblock flags.
    FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
            include/linux/fs.h include/uapi/linux/bfs_fs.h \
            security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
    # the list of MS_... constants
    SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
          DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
          POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
          I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
          ACTIVE NOUSER"

    SED_PROG=
    for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

    # we want files that contain at least one of MS_...,
    # with fs/namespace.c and fs/pnode.c excluded.
    L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

    for f in $L; do sed -i $f $SED_PROG; done

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-27 13:05:09 -08:00
Linus Torvalds
a3841f94c7 libnvdimm for 4.15
* Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable
  'userspace flush' of persistent memory updates via filesystem-dax
   mappings. It arranges for any filesystem metadata updates that may be
   required to satisfy a write fault to also be flushed ("on disk") before
   the kernel returns to userspace from the fault handler. Effectively
   every write-fault that dirties metadata completes an fsync() before
   returning from the fault handler. The new MAP_SHARED_VALIDATE mapping
   type guarantees that the MAP_SYNC flag is validated as supported by the
   filesystem's ->mmap() file operation.
 
 * Add support for the standard ACPI 6.2 label access methods that
   replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods. This
   enables interoperability with environments that only implement the
   standardized methods.
 
 * Add support for the ACPI 6.2 NVDIMM media error injection methods.
 
 * Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for latch
   last shutdown status, firmware update, SMART error injection, and
   SMART alarm threshold control.
 
 * Cleanup physical address information disclosures to be root-only.
 
 * Fix revalidation of the DIMM "locked label area" status to support
   dynamic unlock of the label area.
 
 * Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA
   (system-physical-address) command and error injection commands.
 
 Acknowledgements that came after the commits were pushed to -next:
 
 957ac8c421 dax: fix PMD faults on zero-length files
 Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
 
 a39e596baa xfs: support for synchronous DAX faults
 Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
 
 7b565c9f96 xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
 Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaDfvcAAoJEB7SkWpmfYgCk7sP/2qJhBH+VTTdg2osDnhAdAhI
 co/AGEmsHFlUCMBb/Ek7UnMAmhBYiJU2q4ywPsNFBpusXpMlqNy5Iwo7k4/wQHE/
 SJcIM0g4zg0ViFuUhwV+C2T0R5UzFR8JLd9EYWj/YS6aJpurtotm5l4UStaM0Hzo
 AhxSXJLrBDuqCpbOxbctfiGEmdRL7aRfBEAARTNRKBn/iXxJUcYHlp62rtXQS+t4
 I6LC/URCWTNTTMGmzW6TRsgSD9WMfd19xKcGzN3qL6ee0KFccxN4ctFqHA/sFGOh
 iYLeR0XJUjJxyp+PkWGteXPVZL0Kj3bD/lSTG+Co5bm/ra8a/sh3TSFfgFyoBZD1
 EqMN8Ryf80hGp3FabeH2Iw2SviYPZpHSWgjddjxLD0RA6OmpzINc+Wm8eqApjMME
 sbZDTOijiab4QMQ0XamF4GuDHyQtawv5Y/w2Ehhl1tmiqW+5tKhsKqxkQt+/V3Yt
 RTVSRe2Pkway66b+cD64IdQ6L2tyonPnmi5IzgkKOhlOEGomy+4/U2Jt2bMbhzq6
 ymszKmXp2XI8P06wU8sHrIUeXO5I9qoKn/fZA73Eb8aIzgJe3tBE/5+Ab7RG6HB9
 1OVfcMWoXU1gNgNktTs63X1Lsg4aW9kt/K4fPHHcqUcaliEJpJTlAbg9GLF2buoW
 nQ+0fTRgMRihE3ZA0Fs3
 =h2vZ
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm and dax updates from Dan Williams:
 "Save for a few late fixes, all of these commits have shipped in -next
  releases since before the merge window opened, and 0day has given a
  build success notification.

  The ext4 touches came from Jan, and the xfs touches have Darrick's
  reviewed-by. An xfstest for the MAP_SYNC feature has been through
  a few round of reviews and is on track to be merged.

   - Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable
     'userspace flush' of persistent memory updates via filesystem-dax
     mappings. It arranges for any filesystem metadata updates that may
     be required to satisfy a write fault to also be flushed ("on disk")
     before the kernel returns to userspace from the fault handler.
     Effectively every write-fault that dirties metadata completes an
     fsync() before returning from the fault handler. The new
     MAP_SHARED_VALIDATE mapping type guarantees that the MAP_SYNC flag
     is validated as supported by the filesystem's ->mmap() file
     operation.

   - Add support for the standard ACPI 6.2 label access methods that
     replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods.
     This enables interoperability with environments that only implement
     the standardized methods.

   - Add support for the ACPI 6.2 NVDIMM media error injection methods.

   - Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for
     latch last shutdown status, firmware update, SMART error injection,
     and SMART alarm threshold control.

   - Cleanup physical address information disclosures to be root-only.

   - Fix revalidation of the DIMM "locked label area" status to support
     dynamic unlock of the label area.

   - Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA
     (system-physical-address) command and error injection commands.

  Acknowledgements that came after the commits were pushed to -next:

   - 957ac8c421 ("dax: fix PMD faults on zero-length files"):
       Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

   - a39e596baa ("xfs: support for synchronous DAX faults") and
     7b565c9f96 ("xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()")
        Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>"

* tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (49 commits)
  acpi, nfit: add 'Enable Latch System Shutdown Status' command support
  dax: fix general protection fault in dax_alloc_inode
  dax: fix PMD faults on zero-length files
  dax: stop requiring a live device for dax_flush()
  brd: remove dax support
  dax: quiet bdev_dax_supported()
  fs, dax: unify IOMAP_F_DIRTY read vs write handling policy in the dax core
  tools/testing/nvdimm: unit test clear-error commands
  acpi, nfit: validate commands against the device type
  tools/testing/nvdimm: stricter bounds checking for error injection commands
  xfs: support for synchronous DAX faults
  xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
  ext4: Support for synchronous DAX faults
  ext4: Simplify error handling in ext4_dax_huge_fault()
  dax: Implement dax_finish_sync_fault()
  dax, iomap: Add support for synchronous faults
  mm: Define MAP_SYNC and VM_SYNC flags
  dax: Allow tuning whether dax_insert_mapping_entry() dirties entry
  dax: Allow dax_iomap_fault() to return pfn
  dax: Fix comment describing dax_iomap_fault()
  ...
2017-11-17 09:51:57 -08:00
Mel Gorman
8667982014 mm, pagevec: remove cold parameter for pagevecs
Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot.  As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Jan Kara
67fd707f46 mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages.  Just drop the argument.

Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara
dc7f3e868a ext4: use pagevec_lookup_range_tag()
We want only pages from given range in ext4_writepages().  Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.

Link: http://lkml.kernel.org/r/20171009151359.31984-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:03 -08:00
Linus Torvalds
ae9a8c4bdc Add support for online resizing of file systems with bigalloc. Fix a
two data corruption bugs involving DAX, as well as a corruption bug
 after a crash during a racing fallocate and delayed allocation.
 Finally, a number of cleanups and optimizations.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAloJCiEACgkQ8vlZVpUN
 gaOahAgAhcgdPagn/B5w+6vKFdH+hOJLKyGI0adGDyWD9YBXN0wFQvliVgXrTKei
 hxW2GdQGc6yHv9mOjvD+4Fn2AnTZk8F3GtG6zdqRM08JGF/IN2Jax2boczG/XnUz
 rT9cd3ic2Ff0KaUX+Yos55QwomTh5CAeRPgvB69o9D6L4VJzTlsWKSOBR19FmrSG
 NDmzZibgWmHcqzW9Bq8ZrXXx+KB42kUlc8tYYm2n6MTaE0LMvp3d9XcFcnm/I7Bk
 MGa2d3/3FArGD6Rkl/E82MXMSElOHJnY6jGYSDaadUeMI5FXkA6tECOSJYXqShdb
 ZJwkOBwfv2lbYZJxIBJTy/iA6zdsoQ==
 =ZzaJ
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:

 - Add support for online resizing of file systems with bigalloc

 - Fix a two data corruption bugs involving DAX, as well as a corruption
   bug after a crash during a racing fallocate and delayed allocation.

 - Finally, a number of cleanups and optimizations.

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: improve smp scalability for inode generation
  ext4: add support for online resizing with bigalloc
  ext4: mention noload when recovering on read-only device
  Documentation: fix little inconsistencies
  ext4: convert timers to use timer_setup()
  jbd2: convert timers to use timer_setup()
  ext4: remove duplicate extended attributes defs
  ext4: add ext4_should_use_dax()
  ext4: add sanity check for encryption + DAX
  ext4: prevent data corruption with journaling + DAX
  ext4: prevent data corruption with inline data + DAX
  ext4: fix interaction between i_size, fallocate, and delalloc after a crash
  ext4: retry allocations conservatively
  ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA
  ext4: Add iomap support for inline data
  iomap: Add IOMAP_F_DATA_INLINE flag
  iomap: Switch from blkno to disk offset
2017-11-14 12:59:42 -08:00
Linus Torvalds
32190f0afb fscrypt: lots of cleanups, mostly courtesy by Eric Biggers
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAloI8AUACgkQ8vlZVpUN
 gaMdjgf8CCW7UhPjoZYwF8sUNtAaX9+JZT1maOcXUhpJ3vRQiRn+AzRH6yBYMm79
 +NZBwVlk4dlEe55Wh4yFIStMAstqzCrke4C9CSbExjgHNsJdU4znyYuLRMbLfyO0
 6c4NObiAIKJdW1/te1aN90keGC6min8pBZot+FqZsRr+Kq2+IOtM43JAv7efOLev
 v3LCjUf9JKxatoB8tgw4AJRa1p18p7D2APWTG05VlFq63TjhVIYNvvwcQlizLwGY
 cuEq3X59FbFdX06fJnucujU3WP3ES4/3rhufBK4NNaec5e5dbnH2KlAx7J5SyMIZ
 0qUFB/dmXDSb3gsfScSGo1F71Ad0CA==
 =asAm
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt

Pull fscrypt updates from Ted Ts'o:
 "Lots of cleanups, mostly courtesy by Eric Biggers"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
  fscrypt: lock mutex before checking for bounce page pool
  fscrypt: add a documentation file for filesystem-level encryption
  ext4: switch to fscrypt_prepare_setattr()
  ext4: switch to fscrypt_prepare_lookup()
  ext4: switch to fscrypt_prepare_rename()
  ext4: switch to fscrypt_prepare_link()
  ext4: switch to fscrypt_file_open()
  fscrypt: new helper function - fscrypt_prepare_setattr()
  fscrypt: new helper function - fscrypt_prepare_lookup()
  fscrypt: new helper function - fscrypt_prepare_rename()
  fscrypt: new helper function - fscrypt_prepare_link()
  fscrypt: new helper function - fscrypt_file_open()
  fscrypt: new helper function - fscrypt_require_key()
  fscrypt: remove unneeded empty fscrypt_operations structs
  fscrypt: remove ->is_encrypted()
  fscrypt: switch from ->is_encrypted() to IS_ENCRYPTED()
  fs, fscrypt: add an S_ENCRYPTED inode flag
  fscrypt: clean up include file mess
2017-11-14 11:35:15 -08:00
Dan Williams
aaa422c4c3 fs, dax: unify IOMAP_F_DIRTY read vs write handling policy in the dax core
While reviewing whether MAP_SYNC should strengthen its current guarantee
of syncing writes from the initiating process to also include
third-party readers observing dirty metadata, Dave pointed out that the
check of IOMAP_WRITE is misplaced.

The policy of what to with IOMAP_F_DIRTY should be separated from the
generic filesystem mechanism of reporting dirty metadata. Move this
policy to the fs-dax core to simplify the per-filesystem iomap handlers,
and further centralize code that implements the MAP_SYNC policy. This
otherwise should not change behavior, it just makes it easier to change
behavior in the future.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-13 16:38:44 -08:00
Jan Kara
b8a6176c21 ext4: Support for synchronous DAX faults
We return IOMAP_F_DIRTY flag from ext4_iomap_begin() when asked to
prepare blocks for writing and the inode has some uncommitted metadata
changes. In the fault handler ext4_dax_fault() we then detect this case
(through VM_FAULT_NEEDDSYNC return value) and call helper
dax_finish_sync_fault() to flush metadata changes and insert page table
entry. Note that this will also dirty corresponding radix tree entry
which is what we want - fsync(2) will still provide data integrity
guarantees for applications not using userspace flushing. And
applications using userspace flushing can avoid calling fsync(2) and
thus avoid the performance overhead.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:26 -07:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Eric Biggers
3ce2b8ddd8 ext4: switch to fscrypt_prepare_setattr()
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-10-18 20:21:58 -04:00
Eric Biggers
2ee6a576be fs, fscrypt: add an S_ENCRYPTED inode flag
Introduce a flag S_ENCRYPTED which can be set in ->i_flags to indicate
that the inode is encrypted using the fscrypt (fs/crypto/) mechanism.

Checking this flag will give the same information that
inode->i_sb->s_cop->is_encrypted(inode) currently does, but will be more
efficient.  This will be useful for adding higher-level helper functions
for filesystems to use.  For example we'll be able to replace this:

	if (ext4_encrypted_inode(inode)) {
		ret = fscrypt_get_encryption_info(inode);
		if (ret)
			return ret;
		if (!fscrypt_has_encryption_key(inode))
			return -ENOKEY;
	}

with this:

	ret = fscrypt_require_key(inode);
	if (ret)
		return ret;

... since we'll be able to retain the fast path for unencrypted files as
a single flag check, using an inline function.  This wasn't possible
before because we'd have had to frequently call through the
->i_sb->s_cop->is_encrypted function pointer, even when the encryption
support was disabled or not being used.

Note: we don't define S_ENCRYPTED to 0 if CONFIG_FS_ENCRYPTION is
disabled because we want to continue to return an error if an encrypted
file is accessed without encryption support, rather than pretending that
it is unencrypted.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-10-18 19:52:36 -04:00
Ross Zwisler
6642586b3e ext4: add ext4_should_use_dax()
This helper, in the spirit of ext4_should_dioread_nolock() et al., replaces
the complex conditional in ext4_set_inode_flags().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-10-12 12:00:59 -04:00
Ross Zwisler
e9072d859d ext4: prevent data corruption with journaling + DAX
The current code has the potential for data corruption when changing an
inode's journaling mode, as that can result in a subsequent unsafe change
in S_DAX.

I've captured an instance of this data corruption in the following fstest:

https://patchwork.kernel.org/patch/9948377/

Prevent this data corruption from happening by disallowing changes to the
journaling mode if the '-o dax' mount option was used.  This means that for
a given filesystem we could have a mix of inodes using either DAX or
data journaling, but whatever state the inodes are in will be held for the
duration of the mount.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2017-10-12 11:54:08 -04:00
Christoph Hellwig
545052e9e3 ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA
Switch to the iomap_seek_hole and iomap_seek_data helpers for
implementing lseek SEEK_HOLE / SEEK_DATA, and remove all the code that
isn't needed any more.

Note that with this patch ext4 will now always depend on the iomap code
instead of only when CONFIG_DAX is enabled, and it requires adding a
call into the extent status tree for iomap_begin as well to properly
deal with delalloc extents.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
[More fixes and cleanups by Andreas]
2017-10-01 17:58:54 -04:00
Andreas Gruenbacher
7046ae3532 ext4: Add iomap support for inline data
Report inline data as a IOMAP_F_DATA_INLINE mapping.  This allows to use
iomap_seek_hole and iomap_seek_data in ext4_llseek and makes switching
to iomap_fiemap in ext4_fiemap easier.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-10-01 17:57:54 -04:00
Andreas Gruenbacher
19fe5f643f iomap: Switch from blkno to disk offset
Replace iomap->blkno, the sector number, with iomap->addr, the disk
offset in bytes.  For invalid disk offsets, use the special value
IOMAP_NULL_ADDR instead of IOMAP_NULL_BLOCK.

This allows to use iomap for mappings which are not block aligned, such
as inline data on ext4.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>  # iomap, xfs
Reviewed-by: Jan Kara <jack@suse.cz>
2017-10-01 17:55:54 -04:00
Linus Torvalds
89fd915c40 libnvdimm for 4.14
* Media error handling support in the Block Translation Table (BTT)
   driver is reworked to address sleeping-while-atomic locking and
   memory-allocation-context conflicts.
 
 * The dax_device lookup overhead for xfs and ext4 is moved out of the
   iomap hot-path to a mount-time lookup.
 
 * A new 'ecc_unit_size' sysfs attribute is added to advertise the
   read-modify-write boundary property of a persistent memory range.
 
 * Preparatory fix-ups for arm and powerpc pmem support are included
   along with other miscellaneous fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZtsAGAAoJEB7SkWpmfYgCrzMP/2vPvZvrFjZn5pAoZjlmTmHM
 ySceoOC7vwvVXIsSs52FhSjcxEoXo9cklXPwhXOPVtVUFdSDJBUOIUxwIziE6Y+5
 sFJ2xT9K+5zKBUiXJwqFQDg52dn//eBNnnnDz+HQrBSzGrbWQhIZY2m19omPzv1I
 BeN0OCGOdW3cjSo3BCFl1d+KrSl704e7paeKq/TO3GIiAilIXleTVxcefEEodV2K
 ZvWHpFIhHeyN8dsF8teI952KcCT92CT/IaabxQIwCxX0/8/GFeDc5aqf77qiYWKi
 uxCeQXdgnaE8EZNWZWGWIWul6eYEkoCNbLeUQ7eJnECq61VxVajJS0NyGa5T9OiM
 P046Bo2b1b3R0IHxVIyVG0ZCm3YUMAHSn/3uRxPgESJ4bS/VQ3YP5M6MLxDOlc90
 IisLilagitkK6h8/fVuVrwciRNQ71XEC34t6k7GCl/1ZnLlLT+i4/jc5NRZnGEZh
 aXAAGHdteQ+/mSz6p2UISFUekbd6LerwzKRw8ibDvH6pTud8orYR7g2+JoGhgb6Y
 pyFVE8DhIcqNKAMxBsjiRZ46OQ7qrT+AemdAG3aVv6FaNoe4o5jPLdw2cEtLqtpk
 +DNm0/lSWxxxozjrvu6EUZj6hk8R5E19XpRzV5QJkcKUXMu7oSrFLdMcC4FeIjl9
 K4hXLV3fVBVRMiS0RA6z
 =5iGY
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm from Dan Williams:
 "A rework of media error handling in the BTT driver and other updates.
  It has appeared in a few -next releases and collected some late-
  breaking build-error and warning fixups as a result.

  Summary:

   - Media error handling support in the Block Translation Table (BTT)
     driver is reworked to address sleeping-while-atomic locking and
     memory-allocation-context conflicts.

   - The dax_device lookup overhead for xfs and ext4 is moved out of the
     iomap hot-path to a mount-time lookup.

   - A new 'ecc_unit_size' sysfs attribute is added to advertise the
     read-modify-write boundary property of a persistent memory range.

   - Preparatory fix-ups for arm and powerpc pmem support are included
     along with other miscellaneous fixes"

* tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
  libnvdimm, btt: fix format string warnings
  libnvdimm, btt: clean up warning and error messages
  ext4: fix null pointer dereference on sbi
  libnvdimm, nfit: move the check on nd_reserved2 to the endpoint
  dax: fix FS_DAX=n BLOCK=y compilation
  libnvdimm: fix integer overflow static analysis warning
  libnvdimm, nd_blk: remove mmio_flush_range()
  libnvdimm, btt: rework error clearing
  libnvdimm: fix potential deadlock while clearing errors
  libnvdimm, btt: cache sector_size in arena_info
  libnvdimm, btt: ensure that flags were also unchanged during a map_read
  libnvdimm, btt: refactor map entry operations with macros
  libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
  libnvdimm, nfit: export an 'ecc_unit_size' sysfs attribute
  ext4: perform dax_device lookup at mount
  ext2: perform dax_device lookup at mount
  xfs: perform dax_device lookup at mount
  dax: introduce a fs_dax_get_by_bdev() helper
  libnvdimm, btt: check memory allocation failure
  libnvdimm, label: fix index block size calculation
  ...
2017-09-11 13:10:57 -07:00
Linus Torvalds
d34fc1adf0 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - various misc bits

 - DAX updates

 - OCFS2

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (119 commits)
  mm,fork: introduce MADV_WIPEONFORK
  x86,mpx: make mpx depend on x86-64 to free up VMA flag
  mm: add /proc/pid/smaps_rollup
  mm: hugetlb: clear target sub-page last when clearing huge page
  mm: oom: let oom_reap_task and exit_mmap run concurrently
  swap: choose swap device according to numa node
  mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  z3fold: use per-cpu unbuddied lists
  mm, swap: don't use VMA based swap readahead if HDD is used as swap
  mm, swap: add sysfs interface for VMA based swap readahead
  mm, swap: VMA based swap readahead
  mm, swap: fix swap readahead marking
  mm, swap: add swap readahead hit statistics
  mm/vmalloc.c: don't reinvent the wheel but use existing llist API
  mm/vmstat.c: fix wrong comment
  selftests/memfd: add memfd_create hugetlbfs selftest
  mm/shmem: add hugetlbfs support to memfd_create()
  mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
  mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
  ...
2017-09-06 20:49:49 -07:00
Jan Kara
397162ffa2 mm: remove nr_pages argument from pagevec_lookup{,_range}()
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages.

Just drop the argument.

Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:27 -07:00
Jan Kara
2b85a6171d ext4: use pagevec_lookup_range() in writeback code
Both occurences of pagevec_lookup() actually want only pages from a
given range.  Use pagevec_lookup_range() for the lookup.

Link: http://lkml.kernel.org/r/20170726114704.7626-7-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:27 -07:00
Jan Kara
d72dc8a25a mm: make pagevec_lookup() update index
Make pagevec_lookup() (and underlying find_get_pages()) update index to
the next page where iteration should continue.  Most callers want this
and also pagevec_lookup_tag() already does this.

Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Dan Williams
5e405595e5 ext4: perform dax_device lookup at mount
The ->iomap_begin() operation is a hot path, so cache the
fs_dax_get_by_host() result at mount time to avoid the incurring the
hash lookup overhead on a per-i/o basis.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31 11:12:13 -07:00
Tahsin Erdogan
a6d0567604 ext4: backward compatibility support for Lustre ea_inode implementation
Original Lustre ea_inode feature did not have ref counts on xattr inodes
because there was always one parent that referenced it. New
implementation expects ref count to be initialized which is not true for
Lustre case. Handle this by detecting Lustre created xattr inode and set
its ref count to 1.

The quota handling of xattr inodes have also changed with deduplication
support. New implementation manually manages quotas to support sharing
across multiple users. A consequence is that, a referencing inode
incorporates the blocks of xattr inode into its own i_block field.

We need to know how a xattr inode was created so that we can reverse the
block charges during reference removal. This is handled by introducing a
EXT4_STATE_LUSTRE_EA_INODE flag. The flag is set on a xattr inode if
inode appears to have been created by Lustre. During xattr inode reference
removal, the manual quota uncharge is skipped if the flag is set.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-24 14:25:02 -04:00
Miao Xie
c03b45b853 ext4, project: expand inode extra size if possible
When upgrading from old format, try to set project id
to old file first time, it will return EOVERFLOW, but if
that file is dirtied(touch etc), changing project id will
be allowed, this might be confusing for users, we could
try to expand @i_extra_isize here too.

Reported-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-06 01:00:49 -04:00
Miao Xie
cf0a5e818f ext4: restructure ext4_expand_extra_isize
Current ext4_expand_extra_isize just tries to expand extra isize, if
someone is holding xattr lock or some check fails, it will give up.
So rename its name to ext4_try_to_expand_extra_isize.

Besides that, we clean up unnecessary check and move some relative checks
into it.

Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
2017-08-06 00:40:01 -04:00
Miao Xie
3b10fdc6d8 ext4: fix forgetten xattr lock protection in ext4_expand_extra_isize
We should avoid the contention between the i_extra_isize update and
the inline data insertion, so move the xattr trylock in front of
i_extra_isize update.

Signed-off-by: Miao Xie <miaoxie@huawei.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
2017-08-06 00:27:38 -04:00
Tahsin Erdogan
9699d4f91d ext4: make xattr inode reads faster
ext4_xattr_inode_read() currently reads each block sequentially while
waiting for io operation to complete before moving on to the next
block. This prevents request merging in block layer.

Add a ext4_bread_batch() function that starts reads for all blocks
then optionally waits for them to complete. A similar logic is used
in ext4_find_entry(), so update that code to use the new function.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-06 00:07:01 -04:00
Eric Whitney
1e21196c8e ext4: correct comment references to ext4_ext_direct_IO()
Commit 914f82a32d "ext4: refactor direct IO code" deleted
ext4_ext_direct_IO(), but references to that function remain in
comments.  Update them to refer to ext4_direct_IO_write().

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-30 22:26:40 -04:00
Tahsin Erdogan
407cd7fb83 ext4: change fast symlink test to not rely on i_blocks
ext4_inode_info->i_data is the storage area for 4 types of data:

  a) Extents data
  b) Inline data
  c) Block map
  d) Fast symlink data (symlink length < 60)

Extents data case is positively identified by EXT4_INODE_EXTENTS flag.
Inline data case is also obvious because of EXT4_INODE_INLINE_DATA
flag.

Distinguishing c) and d) however requires additional logic. This
currently relies on i_blocks count. After subtracting external xattr
block from i_blocks, if it is greater than 0 then we know that some
data blocks exist, so there must be a block map.

This logic got broken after ea_inode feature was added. That feature
charges the data blocks of external xattr inodes to the referencing
inode and so adds them to the i_blocks. To fix this, we could subtract
ea_inode blocks by iterating through all xattr entries and then check
whether remaining i_blocks count is zero. Besides being complicated,
this won't change the fact that the current way of distinguishing
between c) and d) is fragile.

The alternative solution is to test whether i_size is less than 60 to
determine fast symlink case. ext4_symlink() uses the same test to decide
whether to store the symlink in i_data. There is one caveat to address
before this can work though.

If an inode's i_nlink is zero during eviction, its i_size is set to
zero and its data is truncated. If system crashes before inode is removed
from the orphan list, next boot orphan cleanup may find the inode with
zero i_size. So, a symlink that had its data stored in a block may now
appear to be a fast symlink. The solution used in this patch is to treat
i_size = 0 as a non-fast symlink case. A zero sized symlink is not legal
so the only time this can happen is the mentioned scenario. This is also
logically correct because a i_size = 0 symlink has no data stored in
i_data.

Suggested-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2017-07-04 00:11:21 -04:00
Eric Biggers
63136858ae ext4: require key for truncate(2) of encrypted file
Currently, filesystems allow truncate(2) on an encrypted file without
the encryption key.  However, it's impossible to correctly handle the
case where the size being truncated to is not a multiple of the
filesystem block size, because that would require decrypting the final
block, zeroing the part beyond i_size, then encrypting the block.

As other modifications to encrypted file contents are prohibited without
the key, just prohibit truncate(2) as well, making it fail with ENOKEY.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-23 19:48:44 -04:00
Jan Kara
3abb1a0fc2 ext4: avoid unnecessary stalls in ext4_evict_inode()
These days inode reclaim calls evict_inode() only when it has no pages
in the mapping.  In that case it is not necessary to wait for transaction
commit in ext4_evict_inode() as there can be no pages waiting to be
committed.  So avoid unnecessary transaction waiting in that case.

We still have to keep the check for the case where ext4_evict_inode()
gets called from other paths (e.g. umount) where inode still can have
some page cache pages.

Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 23:49:46 -04:00
Tahsin Erdogan
7a9ca53aea quota: add get_inode_usage callback to transfer multi-inode charges
Ext4 ea_inode feature allows storing xattr values in external inodes to
be able to store values that are bigger than a block in size. Ext4 also
has deduplication support for these type of inodes. With deduplication,
the actual storage waste is eliminated but the users of such inodes are
still charged full quota for the inodes as if there was no sharing
happening in the background.

This design requires ext4 to manually charge the users because the
inodes are shared.

An implication of this is that, if someone calls chown on a file that
has such references we need to transfer the quota for the file and xattr
inodes. Current dquot_transfer() function implicitly transfers one inode
charge. With ea_inode feature, we would like to transfer multiple inode
charges.

Add get_inode_usage callback which can interrogate the total number of
inodes that were charged for a given inode.

[ Applied fix from Colin King to make sure the 'ret' variable is
  initialized on the successful return path.  Detected by
  CoverityScan, CID#1446616 ("Uninitialized scalar variable") --tytso]

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Jan Kara <jack@suse.cz>
2017-06-22 11:46:48 -04:00
Tahsin Erdogan
dec214d00e ext4: xattr inode deduplication
Ext4 now supports xattr values that are up to 64k in size (vfs limit).
Large xattr values are stored in external inodes each one holding a
single value. Once written the data blocks of these inodes are immutable.

The real world use cases are expected to have a lot of value duplication
such as inherited acls etc. To reduce data duplication on disk, this patch
implements a deduplicator that allows sharing of xattr inodes.

The deduplication is based on an in-memory hash lookup that is a best
effort sharing scheme. When a xattr inode is read from disk (i.e.
getxattr() call), its crc32c hash is added to a hash table. Before
creating a new xattr inode for a value being set, the hash table is
checked to see if an existing inode holds an identical value. If such an
inode is found, the ref count on that inode is incremented. On value
removal the ref count is decremented and if it reaches zero the inode is
deleted.

The quota charging for such inodes is manually managed. Every reference
holder is charged the full size as if there was no sharing happening.
This is consistent with how xattr blocks are also charged.

[ Fixed up journal credits calculation to handle inline data and the
  rare case where an shared xattr block can get freed when two thread
  race on breaking the xattr block sharing. --tytso ]

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 11:44:55 -04:00
Tahsin Erdogan
30a7eb970c ext4: cleanup transaction restarts during inode deletion
During inode deletion, the number of journal credits that will be
needed is hard to determine.  For that reason we have journal
extend/restart calls in several places.  Whenever a transaction is
restarted, filesystem must be in a consistent state because there is
no atomicity guarantee beyond a restart call.

Add ext4_xattr_ensure_credits() helper function which takes care of
journal extend/restart logic.  It also handles getting jbd2 write
access and dirty metadata calls.  This function is called at every
iteration of handling an ea_inode reference.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 11:42:09 -04:00
Tahsin Erdogan
02749a4c20 ext4: add ext4_is_quota_file()
IS_NOQUOTA() indicates whether quota is disabled for an inode. Ext4
also uses it to check whether an inode is for a quota file. The
distinction currently doesn't matter because quota is disabled only
for the quota files. When we start disabling quota for other inodes
in the future, we will want to make the distinction clear.

Replace IS_NOQUOTA() call with ext4_is_quota_file() at places where
we are checking for quota files.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 11:31:25 -04:00
Tahsin Erdogan
0421a189bc ext4: modify ext4_xattr_ino_array to hold struct inode *
Tracking struct inode * rather than the inode number eliminates the
repeated ext4_xattr_inode_iget() call later. The second call cannot
fail in practice but still requires explanation when it wants to ignore
the return value. Avoid the trouble and make things simple.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 10:26:31 -04:00
Tahsin Erdogan
33d201e027 ext4: fix lockdep warning about recursive inode locking
Setting a large xattr value may require writing the attribute contents
to an external inode. In this case we may need to lock the xattr inode
along with the parent inode. This doesn't pose a deadlock risk because
xattr inodes are not directly visible to the user and their access is
restricted.

Assign a lockdep subclass to xattr inode's lock.

 ============================================
 WARNING: possible recursive locking detected
 4.12.0-rc1+ #740 Not tainted
 --------------------------------------------
 python/1822 is trying to acquire lock:
  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff804912ca>] ext4_xattr_set_entry+0x65a/0x7b0

 but task is already holding lock:
  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803d6687>] vfs_setxattr+0x57/0xb0

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&sb->s_type->i_mutex_key#15);
   lock(&sb->s_type->i_mutex_key#15);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 4 locks held by python/1822:
  #0:  (sb_writers#10){.+.+.+}, at: [<ffffffff803d0eef>] mnt_want_write+0x1f/0x50
  #1:  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803d6687>] vfs_setxattr+0x57/0xb0
  #2:  (jbd2_handle){.+.+..}, at: [<ffffffff80493f40>] start_this_handle+0xf0/0x420
  #3:  (&ei->xattr_sem){++++..}, at: [<ffffffff804920ba>] ext4_xattr_set_handle+0x9a/0x4f0

 stack backtrace:
 CPU: 0 PID: 1822 Comm: python Not tainted 4.12.0-rc1+ #740
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 Call Trace:
  dump_stack+0x67/0x9e
  __lock_acquire+0x5f3/0x1750
  lock_acquire+0xb5/0x1d0
  down_write+0x2c/0x60
  ext4_xattr_set_entry+0x65a/0x7b0
  ext4_xattr_block_set+0x1b2/0x9b0
  ext4_xattr_set_handle+0x322/0x4f0
  ext4_xattr_set+0x144/0x1a0
  ext4_xattr_user_set+0x34/0x40
  __vfs_setxattr+0x66/0x80
  __vfs_setxattr_noperm+0x69/0x1c0
  vfs_setxattr+0xa2/0xb0
  setxattr+0x12e/0x150
  path_setxattr+0x87/0xb0
  SyS_setxattr+0xf/0x20
  entry_SYSCALL_64_fastpath+0x18/0xad

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-21 21:17:10 -04:00
Andreas Dilger
e50e5129f3 ext4: xattr-in-inode support
Large xattr support is implemented for EXT4_FEATURE_INCOMPAT_EA_INODE.

If the size of an xattr value is larger than will fit in a single
external block, then the xattr value will be saved into the body
of an external xattr inode.

The also helps support a larger number of xattr, since only the headers
will be stored in the in-inode space or the single external block.

The inode is referenced from the xattr header via "e_value_inum",
which was formerly "e_value_block", but that field was never used.
The e_value_size still contains the xattr size so that listing
xattrs does not need to look up the inode if the data is not accessed.

struct ext4_xattr_entry {
        __u8    e_name_len;     /* length of name */
        __u8    e_name_index;   /* attribute name index */
        __le16  e_value_offs;   /* offset in disk block of value */
        __le32  e_value_inum;   /* inode in which value is stored */
        __le32  e_value_size;   /* size of attribute value */
        __le32  e_hash;         /* hash value of name and value */
        char    e_name[0];      /* attribute name */
};

The xattr inode is marked with the EXT4_EA_INODE_FL flag and also
holds a back-reference to the owning inode in its i_mtime field,
allowing the ext4/e2fsck to verify the correct inode is accessed.

[ Applied fix by Dan Carpenter to avoid freeing an ERR_PTR. ]

Lustre-Jira: https://jira.hpdd.intel.com/browse/LU-80
Lustre-bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=4424
Signed-off-by: Kalpak Shah <kalpak.shah@sun.com>
Signed-off-by: James Simmons <uja.ornl@gmail.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2017-06-21 21:10:32 -04:00
Artem Blagodarenko
e08ac99fa2 ext4: add largedir feature
This INCOMPAT_LARGEDIR feature allows larger directories to be created
in ldiskfs, both with directory sizes over 2GB and and a maximum htree
depth of 3 instead of the current limit of 2. These features are needed
in order to exceed the current limit of approximately 10M entries in a
single directory.

This patch was originally written by Yang Sheng to support the Lustre server.

[ Bumped the credits needed to update an indexed directory -- tytso ]

Signed-off-by: Liang Zhen <liang.zhen@intel.com>
Signed-off-by: Yang Sheng <yang.sheng@intel.com>
Signed-off-by: Artem Blagodarenko <artem.blagodarenko@seagate.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
2017-06-21 21:09:57 -04:00
Jan Kara
67a7d5f561 ext4: fix fdatasync(2) after extent manipulation operations
Currently, extent manipulation operations such as hole punch, range
zeroing, or extent shifting do not record the fact that file data has
changed and thus fdatasync(2) has a work to do. As a result if we crash
e.g. after a punch hole and fdatasync, user can still possibly see the
punched out data after journal replay. Test generic/392 fails due to
these problems.

Fix the problem by properly marking that file data has changed in these
operations.

CC: stable@vger.kernel.org
Fixes: a4bb6b64e3
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-29 13:24:55 -04:00
Jan Kara
a056bdaae7 ext4: fix data corruption for mmap writes
mpage_submit_page() can race with another process growing i_size and
writing data via mmap to the written-back page. As mpage_submit_page()
samples i_size too early, it may happen that ext4_bio_write_page()
zeroes out too large tail of the page and thus corrupts user data.

Fix the problem by sampling i_size only after the page has been
write-protected in page tables by clear_page_dirty_for_io() call.

Reported-by: Michael Zimmer <michael@swarm64.com>
CC: stable@vger.kernel.org
Fixes: cb20d51883
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-26 17:45:45 -04:00