Commit Graph

61792 Commits

Author SHA1 Message Date
David Sterba
b3470b5dbe btrfs: add dedicated members for start and length of a block group
The on-disk format of block group item makes use of the key that stores
the offset and length. This is further used in the code, although this
makes thing harder to understand. The key is also packed so the
offset/length is not properly aligned as u64.

Add start (key.objectid) and length (key.offset) members to block group
and remove the embedded key.  When the item is searched or written, a
local variable for key is used.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:45 +01:00
David Sterba
0222dfdd4a btrfs: rename extent buffer block group item accessors
Accessors defined by BTRFS_SETGET_FUNCS take a raw extent buffer and
manipulate the items there, there's no special prefix required. The
block group accessors had _disk_ because previously the names were
occupied by the on-stack accessors. As this has been addressed in the
previous patch, we can now unify the naming.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:45 +01:00
David Sterba
de0dc456fd btrfs: rename block_group_item on-stack accessors to follow naming
All accessors defined by BTRFS_SETGET_STACK_FUNCS contain _stack_ in the
name, the block group ones were not following that scheme, so let's
switch them.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:45 +01:00
David Sterba
3d976388da btrfs: remove embedded block_group_cache::item
The members ::used and ::flags are now in the block group cache
structure, the last one is chunk_objectid, but that's set to a fixed
value and otherwise unused. The item is constructed from a local
variable before write, so we can remove the embedded one from block
group.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:45 +01:00
David Sterba
f93c63e547 btrfs: move block_group_item::flags to block group
The flags are read from the item that's embedded to block group struct,
but the item will be removed. Use the ::flags after read and before
write.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:44 +01:00
David Sterba
bf38be65f3 btrfs: move block_group_item::used to block group
For unknown reasons, the member 'used' in the block group struct is
stored in the b-tree item and accessed everywhere using the special
accessor helper. Let's unify it and make it a regular member and only
update the item before writing it to the tree.

The item is still being used for flags and chunk_objectid, there's some
duplication until the item is removed in following patches.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:44 +01:00
Qu Wenruo
34b127aecd btrfs: Remove btrfs_bio::flags member
The last user of btrfs_bio::flags was removed in commit 326e1dbb57
("block: remove management of bi_remaining when restoring original
bi_end_io"), remove it.

(Tagged for stable as the structure is heavily used and space savings
are desirable.)

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:44 +01:00
David Sterba
352ae07b59 btrfs: add blake2b to checksumming algorithms
Add blake2b (with 256 bit digest) to the list of possible checksumming
algorithms used by BTRFS.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:44 +01:00
David Sterba
b4e967be43 btrfs: add member for a specific checksum driver
Currently all the checksum algorithms generate a fixed size digest size
and we use it.  The on-disk format can hold up to BTRFS_CSUM_SIZE bytes
and BLAKE2b produces digest of 512 bits by default. We can't do that and
will use the blake2b-256, this needs to be passed to the crypto API.

Separate that from the base algorithm name and add a member to request
specific driver, in this case with the digest size.

The only place that uses the driver name is the crypto API setup.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:44 +01:00
Johannes Thumshirn
41e6d2a808 btrfs: sysfs: show used checksum driver per filesystem
Show the used driver for the checksum algorithm for the filesystem in
sysfs file /sys/fs/btrfs/UUID/features/checksum, eg.

  crc32c (crc32c-generic)

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:43 +01:00
David Sterba
f7cea56c0f btrfs: sysfs: export supported checksums
Export supported checksum algorithms via sysfs in the list of static
features:

  /sys/fs/btrfs/features/supported_checksums

Space spearated list of checksum algorithm names.

Co-developed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:43 +01:00
Johannes Thumshirn
3831bf0094 btrfs: add sha256 to checksumming algorithm
Add sha256 to the list of possible checksumming algorithms used by BTRFS.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:43 +01:00
Johannes Thumshirn
3951e7f050 btrfs: add xxhash64 to checksumming algorithms
Add xxhash64 to the list of possible checksumming algorithms used by
BTRFS.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:43 +01:00
Darrick J. Wong
6519f708cc xfs: report corruption only as a regular error
Redefine XFS_IS_CORRUPT so that it reports corruptions only via
xfs_corruption_report.  Since these are on-disk contents (and not checks
of internal state), we don't ever want to panic the kernel.  This also
amends the corruption report to recommend unmounting and running
xfs_repair.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-18 08:40:44 -08:00
Carlos Maiolino
377bcd5f3b xfs: Remove kmem_zone_free() wrapper
We can remove it now, without needing to rework the KM_ flags.

Use kmem_cache_free() directly.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-18 08:40:44 -08:00
Carlos Maiolino
aaf54eb8bc xfs: Remove kmem_zone_destroy() wrapper
Use kmem_cache_destroy directly

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-18 08:40:44 -08:00
Carlos Maiolino
b1231760e4 xfs: Remove slab init wrappers
Remove kmem_zone_init() and kmem_zone_init_flags() together with their
specific KM_* to SLAB_* flag wrappers.

Use kmem_cache_create() directly.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-18 08:40:43 -08:00
Ingo Molnar
b21feab0b8 Linux 5.4-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl3RzgkeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGN18H/0JZbfIpy8/4Irol
 0va7Aj2fBi1a5oxfqYsMKN0u3GKbN3OV9tQ+7w1eBNGvL72TGadgVTzTY+Im7A9U
 UjboAc7jDPCG+YhIwXFufMiIAq5jDIj6h0LDas7ALsMfsnI/RhTwgNtLTAkyI3dH
 YV/6ljFULwueJHCxzmrYbd1x39PScj3kCNL2pOe6On7rXMKOemY/nbbYYISxY30E
 GMgKApSS+li7VuSqgrKoq5Qaox26LyR2wrXB1ij4pqEJ9xgbnKRLdHuvXZnE+/5p
 46EMirt+yeSkltW3d2/9MoCHaA76ESzWMMDijLx7tPgoTc3RB3/3ZLsm3rYVH+cR
 cRlNNSk=
 =0+Cg
 -----END PGP SIGNATURE-----

Merge tag 'v5.4-rc8' into sched/core, to pick up fixes and dependencies

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 14:41:02 +01:00
Navid Emamdoost
450c3d4166 affs: fix a memory leak in affs_remount
In affs_remount if data is provided it is duplicated into new_opts.  The
allocated memory for new_opts is only released if parse_options fails.

There's a bit of history behind new_options, originally there was
save/replace options on the VFS layer so the 'data' passed must not
change (thus strdup), this got cleaned up in later patches. But not
completely.

There's no reason to do the strdup in cases where the filesystem does
not need to reuse the 'data' again, because strsep would modify it
directly.

Fixes: c8f33d0bec ("affs: kstrdup() memory handling")
Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 14:26:43 +01:00
Davidlohr Bueso
6060c07d47 affs: Replace binary semaphores with mutexes
At a slight footprint cost (24 vs 32 bytes), mutexes are more optimal
than semaphores; it's also a nicer interface for mutual exclusion,
which is why they are encouraged over binary semaphores, when possible.

For both i_link_lock and i_ext_lock (and hence i_hash_lock which I
annotated for the hash lock mapping hackery for lockdep), their semantics
imply traditional lock ownership; that is, the lock owner is the same for
both lock/unlock operations and does not run in irq context. Therefore
it is safe to convert.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 14:26:43 +01:00
David Sterba
8530c37a70 btrfs: get bdev from latest_dev for dio bh_result
To remove use of extent_map::bdev we need to find a replacement, and the
latest_bdev is the only one we can use here, because inode::i_bdev and
superblock::s_bdev are NULL.

The DIO code uses bdev in two places:

* to read blocksize to perform alignment checks in
  do_blockdev_direct_IO, but we do them in btrfs code before any call to
  DIO

* in the following call chain:

  do_direct_IO
    get_more_blocks
     sdio->get_block() <-- this is btrfs_get_blocks_direct

  subsequently the map_bh->b_dev member is used in clean_bdev_aliases
  and dio_new_bio to set the bio's bdev to that of the buffer_head.
  However, because we have provided a submit function dio_bio_submit
  calls our submission function and ignores the bdev.

So it's safe to pass any valid bdev that's used within the filesystem.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:01 +01:00
David Sterba
c3e14909d3 btrfs: assert extent_map bdevs and lookup_map and split
This is a preparatory patch for removing extent_map::bdev. There's some
history behind the code so this is only precaution to catch if things
break before the actual removal happens.

Logically, comparing a raw low-level block device (bdev) does not make
sense for extent maps (high-level objects). This had no effect in
practice but was quite confusing in the code.  The lookup_map is set iff
EXTENT_FLAG_FS_MAPPING is set.

The two pointers were stored in the same bytes and used potentially in
two meanings. Now they're split, so the asserts are in place to check
that the condition will not change.

The lookup map pointer misused bdev, this has been changed in commit
95617d6932 ("btrfs: cleanup, stop casting for extent_map->lookup
everywhere") to the explicit type. But the semantics hasn't changed and
bdev was not actually used to decide if maps are mergeable.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:01 +01:00
Johannes Thumshirn
32ab3d1b4d btrfs: remove pointless indentation in btrfs_read_sys_array()
Instead of checking if we've read a BTRFS_CHUNK_ITEM_KEY from disk and
then process it we could just bail out early if the read disk key wasn't
a BTRFS_CHUNK_ITEM_KEY.

This removes a level of indentation and makes the code nicer to read.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:01 +01:00
Johannes Thumshirn
5ae2169290 btrfs: reduce indentation in btrfs_may_alloc_data_chunk
In btrfs_may_alloc_data_chunk() we're checking if the chunk type is of
type BTRFS_BLOCK_GROUP_DATA and if it is we process it.

Instead of checking if the chunk type is a BTRFS_BLOCK_GROUP_DATA chunk
we can negate the check and bail out early if it isn't.

This makes the code a bit more readable.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:00 +01:00
Johannes Thumshirn
721860d578 btrfs: remove pointless local variable in lock_stripe_add()
In lock_stripe_add() we're caching the bucket for the stripe hash table
just for a single call to dereference the stripe hash.

If we just directly call rbio_bucket() we can safe the pointless local
variable.

Also move the dereferencing of the stripe hash outside of the variable
declaration block to not break over the 80 characters limit.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:00 +01:00
Johannes Thumshirn
9d6cb1b0f9 btrfs: raid56: reduce indentation in lock_stripe_add
In lock_stripe_add() we're traversing the stripe hash list and check if
the current list element's raid_map equals is equal to the raid bio's
raid_map. If both are equal we continue processing.

If we'd check for inequality instead of equality we can reduce one level
of indentation.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:00 +01:00
Nikolay Borisov
bc80230e0e btrfs: Return offset from find_desired_extent
Instead of using an input pointer parameter as the return value and have
an int as the return type of find_desired_extent, rework the function to
directly return the found offset. Doing that the 'ret' variable in
btrfs_llseek_file can be removed. Additional (subjective) benefit is
that btrfs' llseek function now resemebles those of the other major
filesystems.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:00 +01:00
Nikolay Borisov
2034f3b470 btrfs: Simplify btrfs_file_llseek
Handle SEEK_END/SEEK_CUR in a single 'default' case by directly
returning from generic_file_llseek. This makes the 'out' label
redundant.  Finally return directly the vale from vfs_setpos. No
semantic changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:59 +01:00
Nikolay Borisov
d79b7c26b1 btrfs: Speed up btrfs_file_llseek
Modifying the file position is done on a per-file basis. This renders
holding the inode lock for writing useless and makes the performance of
concurrent llseek's abysmal.

Fix this by holding the inode for read. This provides protection against
concurrent truncates and find_desired_extent already includes proper
extent locking for the range which ensures proper locking against
concurrent writes. SEEK_CUR and SEEK_END can be done lockessly.

The former is synchronized by file::f_lock spinlock. SEEK_END is not
synchronized but atomic, but that's OK since there is not guarantee that
SEEK_END will always be at the end of the file in the face of tail
modifications.

This change brings ~82% performance improvement when doing a lot of
parallel fseeks. The workload essentially does:

    for (d=0; d<num_seek_read; d++)
      {
	/* offset %= 16777216; */
	fseek (f, 256 * d % 16777216, SEEK_SET);
	fread (buffer, 64, 1, f);
      }

Without patch:

num workprocesses = 16
num fseek/fread = 8000000
step = 256
fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

real	0m41.412s
user	0m28.777s
sys	2m16.510s

With patch:

num workprocesses = 16
num fseek/fread = 8000000
step = 256
fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

real	0m11.479s
user	0m27.629s
sys	0m21.040s

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:59 +01:00
David Sterba
0cf2521313 btrfs: compression: remove ops pointer from workspace_manager
We can infer the ops from the type that is now passed to all functions
that would need it, this makes workspace_manager::ops redundant and can
be removed.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:59 +01:00
David Sterba
1e00235160 btrfs: compression: inline free_workspace
Replace indirect calls to free_workspace by switch and calls to the
specific callbacks. This is mainly to get rid of the indirection due to
spectre vulnerability mitigations.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:59 +01:00
David Sterba
a3bbd2a9ee btrfs: compression: pass type to btrfs_put_workspace
We can infer the workspace_manager from type and the type will be used
in the following patch to call a common helper for free_workspace.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:59 +01:00
David Sterba
c778df1406 btrfs: compression: inline alloc_workspace
Replace indirect calls to alloc_workspace by switch and calls to the
specific callbacks. This is mainly to get rid of the indirection due to
spectre vulnerability mitigations.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:58 +01:00
David Sterba
5907a9bb13 btrfs: compression: pass type to btrfs_get_workspace
We can infer the workspace_manager from type and the type will be used
in the following patch to call a common helper for alloc_workspace.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:58 +01:00
David Sterba
bd3a5287cc btrfs: compression: inline put_workspace
Similar to get_workspace, majority of the callbacks is trivial, we don't
gain anything by the indirection, so replace them by a switch function.
Trivial callback implementations use the helper.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:58 +01:00
David Sterba
6a0d12724b btrfs: compression: inline get_workspace
Majority of the callbacks is trivial, we don't gain anything by the
indirection, so replace them by a switch function.

ZLIB needs to adjust level in the callback and ZSTD workspace management
is complex, the rest is call to the helper.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:58 +01:00
David Sterba
d20f395f98 btrfs: compression: export alloc/free/get/put callbacks of all algos
The indirect calls will be replaced by a switch in compression.c.
(Switch is faster than indirect calls with when Spectre mitigations are
enabled).

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:58 +01:00
David Sterba
2510307e6c btrfs: compression: inline cleanup_workspace_manager
Replace loop calling to all algos with a list of direct calls to the
cleanup manager callback. When that becomes trivial it is replaced by
direct call to the helper.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:57 +01:00
David Sterba
2dba714390 btrfs: compression: let workspace manager cleanup take only the type
With the access to the workspace structures, we can look it up together
with the compression ops inside the workspace manager cleanup helper.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:57 +01:00
David Sterba
d551703347 btrfs: compression: inline init_workspace_manager
Replace loop calling to all algos with a list of direct calls to the
init manager callback. When that becomes trivial it is replaced by
direct call to the helper.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:57 +01:00
David Sterba
975db48330 btrfs: compression: let workspace manager init take only the type
With the access to the workspace structures, we can look it up together
with the compression ops inside the workspace manager init helper.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:57 +01:00
David Sterba
be95104531 btrfs: compression: attach workspace manager to the ops
There's a lot of indirection when the generic code calls into
algo-specific callbacks to reach the private workspace manager structure
and back to the generic code.

To simplify that, export the workspace manager for heuristic, LZO and
ZLIB, while ZSTD is going to use it's own manager.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:57 +01:00
David Sterba
1e4eb74654 btrfs: switch compression callbacks to direct calls
The indirect calls bring some overhead due to spectre vulnerability
mitigations. The number of cases is small and below the threshold
(10-20) where indirect call would be better.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:57 +01:00
David Sterba
c4bf665a31 btrfs: export compression and decompression callbacks
Export compress_pages, decompress_bio and decompress callbacks for all
compression algos. The indirect calls will be replaced by a switch.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:56 +01:00
Josef Bacik
a60adce85f btrfs: use btrfs_block_group_cache_done in update_block_group
When free'ing extents in a block group we check to see if the block
group is not cached, and then cache it if we need to.  However we'll
just carry on as long as we're loading the cache.  This is problematic
because we are dirtying the block group here.  If we are fast enough we
could do a transaction commit and clear the free space cache while we're
still loading the space cache in another thread.  This truncates the
free space inode, which will keep it from loading the space cache.

Fix this by using the btrfs_block_group_cache_done helper so that we try
to load the space cache unconditionally here, which will result in the
caller waiting for the fast caching to complete and keep us from
truncating the free space inode.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:56 +01:00
Josef Bacik
3797136b62 btrfs: check page->mapping when loading free space cache
While testing 5.2 we ran into the following panic

[52238.017028] BUG: kernel NULL pointer dereference, address: 0000000000000001
[52238.105608] RIP: 0010:drop_buffers+0x3d/0x150
[52238.304051] Call Trace:
[52238.308958]  try_to_free_buffers+0x15b/0x1b0
[52238.317503]  shrink_page_list+0x1164/0x1780
[52238.325877]  shrink_inactive_list+0x18f/0x3b0
[52238.334596]  shrink_node_memcg+0x23e/0x7d0
[52238.342790]  ? do_shrink_slab+0x4f/0x290
[52238.350648]  shrink_node+0xce/0x4a0
[52238.357628]  balance_pgdat+0x2c7/0x510
[52238.365135]  kswapd+0x216/0x3e0
[52238.371425]  ? wait_woken+0x80/0x80
[52238.378412]  ? balance_pgdat+0x510/0x510
[52238.386265]  kthread+0x111/0x130
[52238.392727]  ? kthread_create_on_node+0x60/0x60
[52238.401782]  ret_from_fork+0x1f/0x30

The page we were trying to drop had a page->private, but had no
page->mapping and so called drop_buffers, assuming that we had a
buffer_head on the page, and then panic'ed trying to deref 1, which is
our page->private for data pages.

This is happening because we're truncating the free space cache while
we're trying to load the free space cache.  This isn't supposed to
happen, and I'll fix that in a followup patch.  However we still
shouldn't allow those sort of mistakes to result in messing with pages
that do not belong to us.  So add the page->mapping check to verify that
we still own this page after dropping and re-acquiring the page lock.

This page being unlocked as:
btrfs_readpage
  extent_read_full_page
    __extent_read_full_page
      __do_readpage
        if (!nr)
	   unlock_page  <-- nr can be 0 only if submit_extent_page
			    returns an error

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add callchain ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:56 +01:00
Filipe Manana
536870071d Btrfs: fix metadata space leak on fixup worker failure to set range as delalloc
In the fixup worker, if we fail to mark the range as delalloc in the io
tree, we must release the previously reserved metadata, as well as update
the outstanding extents counter for the inode, otherwise we leak metadata
space.

In pratice we can't return an error from btrfs_set_extent_delalloc(),
which is just a wrapper around __set_extent_bit(), as for most errors
__set_extent_bit() does a BUG_ON() (or panics which hits a BUG_ON() as
well) and returning an -EEXIST error doesn't happen in this case since
the exclusive bits parameter always has a value of 0 through this code
path. Nevertheless, just fix the error handling in the fixup worker,
in case one day __set_extent_bit() can return an error to this code
path.

Fixes: f3038ee3a3 ("btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:56 +01:00
Filipe Manana
a0e248bb50 Btrfs: fix negative subv_writers counter and data space leak after buffered write
When doing a buffered write it's possible to leave the subv_writers
counter of the root, used for synchronization between buffered nocow
writers and snapshotting. This happens in an exceptional case like the
following:

1) We fail to allocate data space for the write, since there's not
   enough available data space nor enough unallocated space for allocating
   a new data block group;

2) Because of that failure, we try to go to NOCOW mode, which succeeds
   and therefore we set the local variable 'only_release_metadata' to true
   and set the root's sub_writers counter to 1 through the call to
   btrfs_start_write_no_snapshotting() made by check_can_nocow();

3) The call to btrfs_copy_from_user() returns zero, which is very unlikely
   to happen but not impossible;

4) No pages are copied because btrfs_copy_from_user() returned zero;

5) We call btrfs_end_write_no_snapshotting() which decrements the root's
   subv_writers counter to 0;

6) We don't set 'only_release_metadata' back to 'false' because we do
   it only if 'copied', the value returned by btrfs_copy_from_user(), is
   greater than zero;

7) On the next iteration of the while loop, which processes the same
   page range, we are now able to allocate data space for the write (we
   got enough data space released in the meanwhile);

8) After this if we fail at btrfs_delalloc_reserve_metadata(), because
   now there isn't enough free metadata space, or in some other place
   further below (prepare_pages(), lock_and_cleanup_extent_if_need(),
   btrfs_dirty_pages()), we break out of the while loop with
   'only_release_metadata' having a value of 'true';

9) Because 'only_release_metadata' is 'true' we end up decrementing the
   root's subv_writers counter to -1 (through a call to
   btrfs_end_write_no_snapshotting()), and we also end up not releasing the
   data space previously reserved through btrfs_check_data_free_space().
   As a consequence the mechanism for synchronizing NOCOW buffered writes
   with snapshotting gets broken.

Fix this by always setting 'only_release_metadata' to false at the start
of each iteration.

Fixes: 8257b2dc3c ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume")
Fixes: 7ee9e4405f ("Btrfs: check if we can nocow if we don't have data space")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:56 +01:00
Marcos Paulo de Souza
b929c1d831 btrfs: ioctl: Try to use btrfs_fs_info instead of *file
Some functions are doing some unnecessary indirection to reach the
btrfs_fs_info struct. Change these functions to receive a btrfs_fs_info
struct instead of a *file.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:55 +01:00
Anand Jain
4273eaff9b btrfs: use bool argument in free_root_pointers()
We don't need int argument bool shall do in free_root_pointers().  And
rename the argument as it confused two people.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:55 +01:00
Chengguang Xu
ce96b7ffd1 btrfs: use better definition of number of compression type
The compression type upper limit constant is the same as the last value
and this is confusing.  In order to keep coding style consistent, use
BTRFS_NR_COMPRESS_TYPES as the total number that follows the idom of
'NR' being one more than the last value.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:55 +01:00
Chengguang Xu
b9b1a53e18 btrfs: use enum for extent type defines
Use enum to replace macro definitions of extent types.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:55 +01:00
Chengguang Xu
b2cd295964 btrfs: props: remove unnecessary hash_init()
DEFINE_HASHTABLE itself has already included initialization code,
we don't have to call hash_init() again, so remove it.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:55 +01:00
Nikolay Borisov
8d510121bf btrfs: Rename btrfs_join_transaction_nolock
This function is used only during the final phase of freespace cache
writeout. This is necessary since using the plain btrfs_join_transaction
api is deadlock prone. The deadlock looks like:

T1:
btrfs_commit_transaction
  commit_cowonly_roots
    btrfs_write_dirty_block_groups
      btrfs_wait_cache_io
        __btrfs_wait_cache_io
       btrfs_wait_ordered_range <-- Triggers ordered IO for freespace
                                    inode and blocks transaction commit
				    until freespace cache writeout

T2: <-- after T1 has triggered the writeout
finish_ordered_fn
  btrfs_finish_ordered_io
    btrfs_join_transaction <--- this would block waiting for current
                                transaction to commit, but since trans
				commit is waiting for this writeout to
				finish

The special purpose functions prevents it by simply skipping the "wait
for writeout" since it's guaranteed the transaction won't proceed until
we are done.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:54 +01:00
Nikolay Borisov
ce6d3eb6fd btrfs: User assert to document transaction requirement
Using an ASSERT in btrfs_pin_extent allows to more stringently observe
whether the function is called under a transaction or not.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:54 +01:00
David Sterba
67439dadb0 btrfs: opencode extent_buffer_get
The helper is trivial and we can understand what the atomic_inc on
something named refs does.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:54 +01:00
Tejun Heo
f7bddf1e27 btrfs: Avoid getting stuck during cyclic writebacks
During a cyclic writeback, extent_write_cache_pages() uses done_index
to update the writeback_index after the current run is over.  However,
instead of current index + 1, it gets to to the current index itself.

Unfortunately, this, combined with returning on EOF instead of looping
back, can lead to the following pathlogical behavior.

1. There is a single file which has accumulated enough dirty pages to
   trigger balance_dirty_pages() and the writer appending to the file
   with a series of short writes.

2. balance_dirty_pages kicks in, wakes up background writeback and sleeps.

3. Writeback kicks in and the cursor is on the last page of the dirty
   file.  Writeback is started or skipped if already in progress.  As
   it's EOF, extent_write_cache_pages() returns and the cursor is set
   to done_index which is pointing to the last page.

4. Writeback is done.  Nothing happens till balance_dirty_pages
   finishes, at which point we go back to #1.

This can almost completely stall out writing back of the file and keep
the system over dirty threshold for a long time which can mess up the
whole system.  We encountered this issue in production with a package
handling application which can reliably reproduce the issue when
running under tight memory limits.

Reading the comment in the error handling section, this seems to be to
avoid accidentally skipping a page in case the write attempt on the
page doesn't succeed.  However, this concern seems bogus.

On each page, the code either:

* Skips and moves onto the next page.

* Fails issue and sets done_index to index + 1.

* Successfully issues and continue to the next page if budget allows
  and not EOF.

IOW, as long as it's not EOF and there's budget, the code never
retries writing back the same page.  Only when a page happens to be
the last page of a particular run, we end up retrying the page, which
can't possibly guarantee anything data integrity related.  Besides,
cyclic writes are only used for non-syncing writebacks meaning that
there's no data integrity implication to begin with.

Fix it by always setting done_index past the current page being
processed.

Note that this problem exists in other writepages too.

CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:54 +01:00
Marcos Paulo de Souza
a9143bd31c btrfs: block-group: Rework documentation of check_system_chunk function
Commit 4617ea3a52 (" Btrfs: fix necessary chunk tree space calculation
when allocating a chunk") removed the is_allocation argument from
check_system_chunk, since the formula for reserving the necessary space
for allocation or removing a chunk would be the same.

So, rework the comment by removing the mention of is_allocation
argument.

Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:54 +01:00
Qu Wenruo
c06631b0d8 btrfs: Enhance error output for write time tree checker
Unlike read time tree checker errors, write time error can't be
inspected by "btrfs inspect dump-tree", so we need extra information to
determine what's going wrong.

The patch will add the following output for write time tree checker
error:

- The content of the offending tree block
  To help determining if it's a false alert.

- Kernel WARN_ON() for debug build
  This is helpful for us to detect unexpected write time tree checker
  error, especially fstests could catch the dmesg.
  Since the WARN_ON() is only triggered for write time tree checker,
  test cases utilizing dm-error won't trigger this WARN_ON(), thus no
  extra noise.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:54 +01:00
Qu Wenruo
80d7fd1e09 btrfs: tree-checker: Refactor prev_key check for ino into a function
Refactor the check for prev_key->objectid of the following key types
into one function, check_prev_ino():

- EXTENT_DATA
- INODE_REF
- DIR_INDEX
- DIR_ITEM
- XATTR_ITEM

Also add the check of prev_key for INODE_REF.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:53 +01:00
Chris Mason
dbb70becde Btrfs: extent_write_locked_range() should attach inode->i_wb
extent_write_locked_range() is used when we're falling back to buffered
IO from inside of compression.  It allocates its own wbc and should
associate it with the inode's i_wb to make sure the IO goes down from
the correct cgroup.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:53 +01:00
Chris Mason
ec39f7696c Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios
Async CRCs and compression submit IO through helper threads, which means
they have IO priority inversions when cgroup IO controllers are in use.

This flags all of the writes submitted by btrfs helper threads as
REQ_CGROUP_PUNT.  submit_bio() will punt these to dedicated per-blkcg
work items to avoid the priority inversion.

For the compression code, we take a reference on the wbc's blkg css and
pass it down to the async workers.

For the async CRCs, the bio already has the correct css, we just need to
tell the block layer to use REQ_CGROUP_PUNT.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
Modified-and-reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:53 +01:00
Chris Mason
1d53c9e672 Btrfs: only associate the locked page with one async_chunk struct
The btrfs writepages function collects a large range of pages flagged
for delayed allocation, and then sends them down through the COW code
for processing.  When compression is on, we allocate one async_chunk
structure for every 512K, and then run those pages through the
compression code for IO submission.

writepages starts all of this off with a single page, locked by the
original call to extent_write_cache_pages(), and it's important to keep
track of this page because it has already been through
clear_page_dirty_for_io().

The btrfs async_chunk struct has a pointer to the locked_page, and when
we're redirtying the page because compression had to fallback to
uncompressed IO, we use page->index to decide if a given async_chunk
struct really owns that page.

But, this is racey.  If a given delalloc range is broken up into two
async_chunks (chunkA and chunkB), we can end up with something like
this:

 compress_file_range(chunkA)
 submit_compress_extents(chunkA)
 submit compressed bios(chunkA)
 put_page(locked_page)

				 compress_file_range(chunkB)
				 ...

Or:

 async_cow_submit
  submit_compressed_extents <--- falls back to buffered writeout
   cow_file_range
    extent_clear_unlock_delalloc
     __process_pages_contig
       put_page(locked_pages)

					    async_cow_submit

The end result is that chunkA is completed and cleaned up before chunkB
even starts processing.  This means we can free locked_page() and reuse
it elsewhere.  If we get really lucky, it'll have the same page->index
in its new home as it did before.

While we're processing chunkB, we might decide we need to fall back to
uncompressed IO, and so compress_file_range() will call
__set_page_dirty_nobufers() on chunkB->locked_page.

Without cgroups in use, this creates as a phantom dirty page, which
isn't great but isn't the end of the world. What can happen, it can go
through the fixup worker and the whole COW machinery again:

in submit_compressed_extents():
  while (async extents) {
  ...
    cow_file_range
    if (!page_started ...)
      extent_write_locked_range
    else if (...)
      unlock_page
    continue;

This hasn't been observed in practice but is still possible.

With cgroups in use, we might crash in the accounting code because
page->mapping->i_wb isn't set.

  BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
  IP: percpu_counter_add_batch+0x11/0x70
  PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
  Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
  CPU: 16 PID: 2172 Comm: rm Not tainted
  RIP: 0010:percpu_counter_add_batch+0x11/0x70
  RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
  RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
  RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
  RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
  R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
  R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
  FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   account_page_cleaned+0x15b/0x1f0
   __cancel_dirty_page+0x146/0x200
   truncate_cleanup_page+0x92/0xb0
   truncate_inode_pages_range+0x202/0x7d0
   btrfs_evict_inode+0x92/0x5a0
   evict+0xc1/0x190
   do_unlinkat+0x176/0x280
   do_syscall_64+0x63/0x1a0
   entry_SYSCALL_64_after_hwframe+0x42/0xb7

The fix here is to make asyc_chunk->locked_page NULL everywhere but the
one async_chunk struct that's allowed to do things to the locked page.

Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
Fixes: 771ed689d2 ("Btrfs: Optimize compressed writeback and reads")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
[ update changelog from mail thread discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:53 +01:00
Chris Mason
ba8a9d0795 Btrfs: delete the entire async bio submission framework
Now that we're not using btrfs_schedule_bio() anymore, delete all the
code that supported it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:53 +01:00
Chris Mason
08635bae0b Btrfs: stop using btrfs_schedule_bio()
btrfs_schedule_bio() hands IO off to a helper thread to do the actual
submit_bio() call.  This has been used to make sure async crc and
compression helpers don't get stuck on IO submission.  To maintain good
performance, over time the IO submission threads duplicated some IO
scheduler characteristics such as high and low priority IOs and they
also made some ugly assumptions about request allocation batch sizes.

All of this cost at least one extra context switch during IO submission,
and doesn't fit well with the modern blkmq IO stack.  So, this commit stops
using btrfs_schedule_bio().  We may need to adjust the number of async
helper threads for crcs and compression, but long term it's a better
path.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:52 +01:00
David Sterba
e1f60a6580 btrfs: add __pure attribute to functions
The attribute is more relaxed than const and the functions could
dereference pointers, as long as the observable state is not changed. We
do have such functions, based on -Wsuggest-attribute=pure .

The visible effects of this patch are negligible, there are differences
in the assembly but hard to summarize.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:52 +01:00
David Sterba
4143cb8b6f btrfs: add const function attribute
For some reason the attribute is called __attribute_const__ and not
__const, marks functions that have no observable effects on program
state, IOW not reading pointers, just the arguments and calculating a
value. Allows the compiler to do some optimizations, based on
-Wsuggest-attribute=const . The effects are rather small, though, about
60 bytes decrese of btrfs.ko.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:52 +01:00
David Sterba
b105e92755 btrfs: add __cold attribute to more functions
The attribute can mark functions supposed to be called rarely if at all
and the text can be moved to sections far from the other code. The
attribute has been added to several functions already, this patch is
based on hints given by gcc -Wsuggest-attribute=cold.

The net effect of this patch is decrease of btrfs.ko by 1000-1300,
depending on the config options.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:52 +01:00
David Sterba
4c66e0d424 btrfs: drop unused parameter is_new from btrfs_iget
The parameter is now always set to NULL and could be dropped. The last
user was get_default_root but that got reworked in 05dbe6837b ("Btrfs:
unify subvol= and subvolid= mounting") and the parameter became unused.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:52 +01:00
Josef Bacik
baf320b9d5 btrfs: use refcount_inc_not_zero in kill_all_nodes
We hit the following warning while running down a different problem

[ 6197.175850] ------------[ cut here ]------------
[ 6197.185082] refcount_t: underflow; use-after-free.
[ 6197.194704] WARNING: CPU: 47 PID: 966 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60
[ 6197.521792] Call Trace:
[ 6197.526687]  __btrfs_release_delayed_node+0x76/0x1c0
[ 6197.536615]  btrfs_kill_all_delayed_nodes+0xec/0x130
[ 6197.546532]  ? __btrfs_btree_balance_dirty+0x60/0x60
[ 6197.556482]  btrfs_clean_one_deleted_snapshot+0x71/0xd0
[ 6197.566910]  cleaner_kthread+0xfa/0x120
[ 6197.574573]  kthread+0x111/0x130
[ 6197.581022]  ? kthread_create_on_node+0x60/0x60
[ 6197.590086]  ret_from_fork+0x1f/0x30
[ 6197.597228] ---[ end trace 424bb7ae00509f56 ]---

This is because the free side drops the ref without the lock, and then
takes the lock if our refcount is 0.  So you can have nodes on the tree
that have a refcount of 0.  Fix this by zero'ing out that element in our
temporary array so we don't try to kill it again.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:51 +01:00
Anand Jain
aa6c0df73e btrfs: print process name and pid that calls device scanning
Its very helpful if we had logged the device scanner process name to
debug the race condition between the systemd-udevd scan and the user
initiated device forget command.

This patch adds process name and pid to the scan message.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add pid to the message ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:51 +01:00
Nikolay Borisov
725af92a62 btrfs: Open-code name_in_log_ref in replay_one_name
That function adds unnecessary indirection between backref_in_log and
the caller. Furthermore it also "downgrades" backref_in_log's return
value to a boolean, when in fact it could very well be an error.

Rectify the situation by simply opencoding name_in_log_ref in
replay_one_name and properly handling possible return codes from
backref_in_log.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:51 +01:00
Nikolay Borisov
d3316c8233 btrfs: Properly handle backref_in_log retval
This function can return a negative error value if btrfs_search_slot
errors for whatever reason or if btrfs_alloc_path runs out of memory.
This is currently problemattic because backref_in_log is treated by its
callers as if it returns boolean.

Fix this by adding proper error handling in callers. That also enables
the function to return the direct error code from btrfs_search_slot.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:51 +01:00
Nikolay Borisov
89cbf5f6b6 btrfs: Don't opencode btrfs_find_name_in_backref in backref_in_log
Direct replacement, though note that the inside of the loop in
btrfs_find_name_in_backref is organized in a slightly different way but
is equvalent.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:51 +01:00
Qu Wenruo
3296bf5624 btrfs: transaction: Cleanup unused TRANS_STATE_BLOCKED
The state was introduced in commit 4a9d8bdee3 ("Btrfs: make the state
of the transaction more readable"), then in commit 302167c50b
("btrfs: don't end the transaction for delayed refs in throttle") the
state is completely removed.

So we can just clean up the state since it's only compared but never
set.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:50 +01:00
Qu Wenruo
61c047b541 btrfs: transaction: describe transaction states and transitions
Add an overview of the basic btrfs transaction transitions, including
the following states:

- No transaction states
- Transaction N [[TRANS_STATE_RUNNING]]
- Transaction N [[TRANS_STATE_COMMIT_START]]
- Transaction N [[TRANS_STATE_COMMIT_DOING]]
- Transaction N [[TRANS_STATE_UNBLOCKED]]
- Transaction N [[TRANS_STATE_COMPLETED]]

For each state, the comment will include:

- Basic explaination about current state
- How to go next stage
- What will happen if we call various start_transaction() functions
- Relationship to transaction N+1

This doesn't provide tech details, but serves as a cheat sheet for
reader to get into the code a little easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:50 +01:00
David Sterba
c1499166d1 btrfs: use has_single_bit_set for clarity
Replace is_power_of_2 with the helper that is self-documenting and
remove the open coded call in alloc_profile_is_valid.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:50 +01:00
David Sterba
79c8264e44 btrfs: add 64bit safe helper for power of two checks
As is_power_of_two takes unsigned long, it's not safe on 32bit
architectures, but we could pass any u64 value in seveal places. Add a
separate helper and also an alias that better expresses the purpose for
which the helper is used.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:50 +01:00
Anand Jain
e62869be1e btrfs: balance: use term redundancy instead of integrity in message
When balance reduces the number of copies of metadata, it reduces the
redundancy, use the term redundancy instead of integrity.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:50 +01:00
David Sterba
1f95ec012c btrfs: move btrfs_unlock_up_safe to other locking functions
The function belongs to the family of locking functions, so move it
there. The 'noinline' keyword is dropped as it's now an exported
function that does not need it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:49 +01:00
David Sterba
ed2b1d36a9 btrfs: move btrfs_set_path_blocking to other locking functions
The function belongs to the family of locking functions, so move it
there. The 'noinline' keyword is dropped as it's now an exported
function that does not need it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:49 +01:00
David Sterba
31f6e769ce btrfs: make btrfs_assert_tree_locked static inline
The function btrfs_assert_tree_locked is used outside of the locking
code so it is exported, however we can make it static inine as it's
fairly trivial.

This is the only locking assertion used in release builds, inlining
improves the text size by 174 bytes and reduces stack consumption in the
callers.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:49 +01:00
David Sterba
d6156218be btrfs: make locking assertion helpers static inline
I've noticed that none of the btrfs_assert_*lock* debugging helpers is
inlined, despite they're short and mostly a value update. Making them
inline shaves 67 from the text size, reduces stack consumption and
perhaps also slightly improves the performance due to avoiding
unnecessary calls.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:49 +01:00
Omar Sandoval
c9eb55db84 btrfs: get rid of pointless wtag variable in async-thread.c
Commit ac0c7cf8be ("btrfs: fix crash when tracepoint arguments are
freed by wq callbacks") added a void pointer, wtag, which is passed into
trace_btrfs_all_work_done() instead of the freed work item. This is
silly for a few reasons:

1. The freed work item still has the same address.
2. work is still in scope after it's freed, so assigning wtag doesn't
   stop anyone from using it.
3. The tracepoint has always taken a void * argument, so assigning wtag
   doesn't actually make things any more type-safe. (Note that the
   original bug in commit bc074524e1 ("btrfs: prefix fsid to all trace
   events") was that the void * was implicitly casted when it was passed
   to btrfs_work_owner() in the trace point itself).

Instead, let's add some clearer warnings as comments.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:49 +01:00
Omar Sandoval
a0cac0ec96 btrfs: get rid of unique workqueue helper functions
Commit 9e0af23764 ("Btrfs: fix task hang under heavy compressed
write") worked around the issue that a recycled work item could get a
false dependency on the original work item due to how the workqueue code
guarantees non-reentrancy. It did so by giving different work functions
to different types of work.

However, the fixes in the previous few patches are more complete, as
they prevent a work item from being recycled at all (except for a tiny
window that the kernel workqueue code handles for us). This obsoletes
the previous fix, so we don't need the unique helpers for correctness.
The only other reason to keep them would be so they show up in stack
traces, but they always seem to be optimized to a tail call, so they
don't show up anyways. So, let's just get rid of the extra indirection.

While we're here, rename normal_work_helper() to the more informative
btrfs_work_helper().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:48 +01:00
Omar Sandoval
57d4f0b863 btrfs: don't prematurely free work in scrub_missing_raid56_worker()
Currently, scrub_missing_raid56_worker() puts and potentially frees
sblock (which embeds the work item) and then submits a bio through
scrub_wr_submit(). This is another potential instance of the bug in
"btrfs: don't prematurely free work in run_ordered_work()". Fix it by
dropping the reference after we submit the bio.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:48 +01:00
Omar Sandoval
e732fe95e4 btrfs: don't prematurely free work in reada_start_machine_worker()
Currently, reada_start_machine_worker() frees the reada_machine_work and
then calls __reada_start_machine() to do readahead. This is another
potential instance of the bug in "btrfs: don't prematurely free work in
run_ordered_work()".

There _might_ already be a deadlock here: reada_start_machine_worker()
can depend on itself through stacked filesystems (__read_start_machine()
-> reada_start_machine_dev() -> reada_tree_block_flagged() ->
read_extent_buffer_pages() -> submit_one_bio() ->
btree_submit_bio_hook() -> btrfs_map_bio() -> submit_stripe_bio() ->
submit_bio() onto a loop device can trigger readahead on the lower
filesystem).

Either way, let's fix it by freeing the work at the end.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:48 +01:00
Omar Sandoval
9be490f1e1 btrfs: don't prematurely free work in end_workqueue_fn()
Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds
the work item) and then calls bio_endio(). This is another potential
instance of the bug in "btrfs: don't prematurely free work in
run_ordered_work()".

In particular, the endio call may depend on other work items. For
example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() ->
__btrfs_correct_data_nocsum() -> dio_read_error() ->
submit_dio_repair_bio(), which submits a bio that is also completed
through a end_workqueue_fn() work item. However,
__btrfs_correct_data_nocsum() waits for the newly submitted bio to
complete, thus it depends on another work item.

This example currently usually works because we use different workqueue
helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR.
However, it may deadlock with stacked filesystems and is fragile
overall. The proper fix is to free the work item at the very end of the
work function, so let's do that.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:48 +01:00
Omar Sandoval
c495dcd6fb btrfs: don't prematurely free work in run_ordered_work()
We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:

1. The top (loop device) filesystem queues an async_cow work item from
   cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
   (run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
   (work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
   to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
   work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
   thread A, so it schedules X to be executed _after_ worker thread A
   finishes (see the find_worker_executing_work() call in
   process_one_work()).

Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.

This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.

P.S.:

One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.

Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.

Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:48 +01:00
Omar Sandoval
cdc6f1668e btrfs: get rid of unnecessary memset() of work item
Commit fc97fab0ea ("btrfs: Replace fs_info->qgroup_rescan_worker
workqueue with btrfs_workqueue.") converted qgroup_rescan_work to be
initialized with btrfs_init_work(), but it left behind an unnecessary
memset(). Get rid of the memset().

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:47 +01:00
Josef Bacik
b3f167aa6c btrfs: move the failrec tree stuff into extent-io-tree.h
This needs to be cleaned up in the future, but for now it belongs to the
extent-io-tree stuff since it uses the internal tree search code.
Needed to export get_state_failrec and set_state_failrec as well since
we're not going to move the actual IO part of the failrec stuff out at
this point.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:47 +01:00
Josef Bacik
083e75e7e6 btrfs: export find_delalloc_range
This utilizes internal stuff to the extent_io_tree, so we need to export
it before we move it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:47 +01:00
Josef Bacik
9c7d3a5483 btrfs: move extent_io_tree defs to their own header
extent_io.c/h are huge, encompassing a bunch of different things.  The
extent_io_tree code can live on its own, so separate this out.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:47 +01:00
Josef Bacik
6f0d04f8e7 btrfs: separate out the extent io init function
We are moving extent_io_tree into it's on file, so separate out the
extent_state init stuff from extent_io_tree_init().

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:47 +01:00
Josef Bacik
33ca832fef btrfs: separate out the extent leak code
We check both extent buffer and extent state leaks in the same function,
separate these two functions out so we can move them around.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:46 +01:00
Qu Wenruo
34ffafdba1 btrfs: ctree: Remove stray comment of setting up path lock
The following comment shows up in btrfs_search_slot() with out much
sense:

	/*
	 * setup the path here so we can release it under lock
	 * contention with the cow code
	 */
	if (cow) {
		/* code touching path->lock[] is far away from here */
	}

This comment hasn't been cleaned up after the relevant code has been
removed.

The original code is introduced in commit 65b51a009e
("btrfs_search_slot: reduce lock contention by cowing in two stages"):

  +
  +               /*
  +                * setup the path here so we can release it under lock
  +                * contention with the cow code
  +                */
  +               p->nodes[level] = b;
  +               if (!p->skip_locking)
  +                       p->locks[level] = 1;
  +

But in current code, we have different timing for modifying path lock,
so just remove the comment.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:46 +01:00
Qu Wenruo
abe9339d69 btrfs: ctree: Reduce one indent level for btrfs_search_old_slot()
Similar to btrfs_search_slot() done in previous patch, make a shortcut
for the level 0 case and allow to reduce indentation for the remaining
case.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:46 +01:00
Qu Wenruo
f624d97608 btrfs: ctree: Reduce one indent level for btrfs_search_slot()
In btrfs_search_slot(), we something like:

	if (level != 0) {
		/* Do search inside tree nodes*/
	} else {
		/* Do search inside tree leaves */
		goto done;
	}

This caused extra indent for tree node search code.  Change it to
something like:

	if (level == 0) {
		/* Do search inside tree leaves */
		goto done'
	}
	/* Do search inside tree nodes */

So we have more space to maneuver our code, this is especially useful as
the tree nodes search code is more complex than the leaves search code.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:46 +01:00
Qu Wenruo
71bf92a9b8 btrfs: tree-checker: Add check for INODE_REF
For INODE_REF we will check:
- Objectid (ino) against previous key
  To detect missing INODE_ITEM.

- No overflow/padding in the data payload
  Much like DIR_ITEM, but with less members to check.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:46 +01:00
Qu Wenruo
c18679ebd8 btrfs: tree-checker: Try to detect missing INODE_ITEM
For the following items, key->objectid is inode number:
- DIR_ITEM
- DIR_INDEX
- XATTR_ITEM
- EXTENT_DATA
- INODE_REF

So in the subvolume tree, such items must have its previous item share the
same objectid, e.g.:

 (257 INODE_ITEM 0)
 (257 DIR_INDEX xxx)
 (257 DIR_ITEM xxx)
 (258 INODE_ITEM 0)
 (258 INODE_REF 0)
 (258 XATTR_ITEM 0)
 (258 EXTENT_DATA 0)

But if we have the following sequence, then there is definitely
something wrong, normally some INODE_ITEM is missing, like:

 (257 INODE_ITEM 0)
 (257 DIR_INDEX xxx)
 (257 DIR_ITEM xxx)
 (258 XATTR_ITEM 0)  <<< objecitd suddenly changed to 258
 (258 EXTENT_DATA 0)

So just by checking the previous key for above inode based key types, we
can detect a missing inode item.

For INODE_REF key type, the check will be added along with INODE_REF
checker.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:46 +01:00
Filipe Manana
b9fae2ebee Btrfs: make btrfs_wait_extents() static
It's not used ouside of transaction.c

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:45 +01:00
Nikolay Borisov
35b814f3c5 btrfs: Add assert to catch nested transaction commit
A recent patch to btrfs showed that there was at least 1 case where a
nested transaction was committed. Nested transaction in this case means
a code which has a transaction handle calls some function which in turn
obtains a copy of the same transaction handle. In such cases the correct
thing to do is for the lower callee to call btrfs_end_transaction which
contains appropriate checks so as to not commit the transaction which
will result in stale trans handler for the caller.

To catch such cases add an assert in btrfs_commit_transaction ensuring
btrfs_trans_handle::use_count is always 1.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:45 +01:00
Goldwyn Rodrigues
9cf35f6735 btrfs: simplify inode locking for RWF_NOWAIT
This is similar to 942491c9e6 ("xfs: fix AIM7 regression"). Apparently
our current rwsem code doesn't like doing the trylock, then lock for
real scheme. This causes extra contention on the lock and can be
measured eg. by AIM7 benchmark.  So change our read/write methods to
just do the trylock for the RWF_NOWAIT case.

Fixes: edf064e7c6 ("btrfs: nowait aio support")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:45 +01:00
Chuck Lever
21f86d2d63 NFS4: Trace lock reclaims
One of the most frustrating messages our sustaining team sees is
the "Lock reclaim failed!" message. Add some observability in the
client's lock reclaim logic so we can capture better data the
first time a problem occurs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18 11:04:32 +01:00
Chuck Lever
511ba52e4c NFS4: Trace state recovery operation
Add a trace point in the main state manager loop to observe state
recovery operation. Help track down state recovery bugs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18 10:58:39 +01:00
Olga Kornievskaia
f751c54525 NFSv4.2 fix memory leak in nfs42_ssc_open
Static analysis with Coverity detected a memory leak

Reported-by: Colin King <colin.king@canonical.com>
Fixes: ec4b092508 ("NFS: inter ssc open")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18 10:50:41 +01:00
Olga Kornievskaia
66588abe2d NFSv4.2 fix kfree in __nfs42_copy_file_range
This is triggering problems with static analysis with Coverity

Reported-by: Colin King <colin.king@netapp.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18 10:50:30 +01:00
YueHaibing
843aa17a35 NFS: remove duplicated include from nfs4file.c
Remove duplicated include.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18 10:47:39 +01:00
YueHaibing
0003010424 NFSv4: Make _nfs42_proc_copy_notify() static
Fix sparse warning:

fs/nfs/nfs42proc.c:527:5: warning:
 symbol '_nfs42_proc_copy_notify' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18 10:47:38 +01:00
Anna Schumaker
913eca1aea NFS: Fallocate should use the nfs4_fattr_bitmap
Changing a sparse file could have an effect not only on the file size,
but also on the number of blocks used by the file in the underlying
filesystem. The server's cache_consistency_bitmap doesn't update the
SPACE_USED attribute, so let's switch to the nfs4_fattr_bitmap to catch
this update whenever we do an ALLOCATE or DEALLOCATE.

This patch fixes xfstests generic/568, which tests that fallocating an
unaligned range allocates all blocks touched by that range. Without this
patch, `stat` reports 0 bytes used immediately after the fallocate.
Adding a `sleep 5` to the test also catches the update, but it's better
to do so when we know something has changed.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18 10:47:05 +01:00
Anna Schumaker
89658c4d04 NFS: Return -ETXTBSY when attempting to write to a swapfile
My understanding is that -EBUSY refers to the underlying device, and
that -ETXTBSY is used when attempting to access a file in use by the
kernel (like a swapfile). Changing this return code helps us pass
xfstests generic/569

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18 10:43:24 +01:00
Saurav Girepunje
0e96322b24 fs: nfs: sysfs: Remove NULL check before kfree
Remove NULL check before kfree, NULL check is taken care
on kfree.

Signed-off-by: Saurav Girepunje <saurav.girepunje@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18 10:38:04 +01:00
YueHaibing
9c91fa36b6 NFS: remove unneeded semicolon
remove unneeded semicolon.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18 10:37:58 +01:00
Ben Dooks
d49dd11753 NFSv4: add declaration of current_stateid
The current_stateid is exported from nfs4state.c but not
declared in any of the headers. Add to nfs4_fs.h to
remove the following warning:

fs/nfs/nfs4state.c:80:20: warning: symbol 'current_stateid' was not declared. Should it be static?

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18 10:36:45 +01:00
Zhihao Cheng
6abf572621 ubifs: ubifs_tnc_start_commit: Fix OOB in layout_in_gaps
Running stress-test test_2 in mtd-utils on ubi device, sometimes we can
get following oops message:

  BUG: unable to handle page fault for address: ffffffff00000140
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 280a067 P4D 280a067 PUD 0
  Oops: 0000 [#1] SMP
  CPU: 0 PID: 60 Comm: kworker/u16:1 Kdump: loaded Not tainted 5.2.0 #13
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0
  -0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  Workqueue: writeback wb_workfn (flush-ubifs_0_0)
  RIP: 0010:rb_next_postorder+0x2e/0xb0
  Code: 80 db 03 01 48 85 ff 0f 84 97 00 00 00 48 8b 17 48 83 05 bc 80 db
  03 01 48 83 e2 fc 0f 84 82 00 00 00 48 83 05 b2 80 db 03 01 <48> 3b 7a
  10 48 89 d0 74 02 f3 c3 48 8b 52 08 48 83 05 a3 80 db 03
  RSP: 0018:ffffc90000887758 EFLAGS: 00010202
  RAX: ffff888129ae4700 RBX: ffff888138b08400 RCX: 0000000080800001
  RDX: ffffffff00000130 RSI: 0000000080800024 RDI: ffff888138b08400
  RBP: ffff888138b08400 R08: ffffea0004a6b920 R09: 0000000000000000
  R10: ffffc90000887740 R11: 0000000000000001 R12: ffff888128d48000
  R13: 0000000000000800 R14: 000000000000011e R15: 00000000000007c8
  FS:  0000000000000000(0000) GS:ffff88813ba00000(0000)
  knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffffff00000140 CR3: 000000013789d000 CR4: 00000000000006f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
    destroy_old_idx+0x5d/0xa0 [ubifs]
    ubifs_tnc_start_commit+0x4fe/0x1380 [ubifs]
    do_commit+0x3eb/0x830 [ubifs]
    ubifs_run_commit+0xdc/0x1c0 [ubifs]

Above Oops are due to the slab-out-of-bounds happened in do-while of
function layout_in_gaps indirectly called by ubifs_tnc_start_commit. In
function layout_in_gaps, there is a do-while loop placing index nodes
into the gaps created by obsolete index nodes in non-empty index LEBs
until rest index nodes can totally be placed into pre-allocated empty
LEBs. @c->gap_lebs points to a memory area(integer array) which records
LEB numbers used by 'in-the-gaps' method. Whenever a fitable index LEB
is found, corresponding lnum will be incrementally written into the
memory area pointed by @c->gap_lebs. The size
((@c->lst.idx_lebs + 1) * sizeof(int)) of memory area is allocated before
do-while loop and can not be changed in the loop. But @c->lst.idx_lebs
could be increased by function ubifs_change_lp (called by
layout_leb_in_gaps->ubifs_find_dirty_idx_leb->get_idx_gc_leb) during the
loop. So, sometimes oob happens when number of cycles in do-while loop
exceeds the original value of @c->lst.idx_lebs. See detail in
https://bugzilla.kernel.org/show_bug.cgi?id=204229.
This patch fixes oob in layout_in_gaps.

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17 22:22:54 +01:00
Zhihao Cheng
10256f0009 ubifs: do_kill_orphans: Fix a memory leak bug
If there are more than one valid snod on the sleb->nodes list,
do_kill_orphans will malloc ino more than once without releasing
previous ino's memory. Finally, it will trigger memory leak.

Fixes: ee1438ce5d ("ubifs: Check link count of inodes when...")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17 22:22:54 +01:00
Richard Weinberger
91cbf01178 Revert "ubifs: Fix memory leak bug in alloc_ubifs_info() error path"
This reverts commit 9163e0184b.

At the point when ubifs_fill_super() runs, we have already a reference
to the super block. So upon deactivate_locked_super() c will get
free()'ed via ->kill_sb().

Cc: Wenwen Wang <wenwen@cs.uga.edu>
Fixes: 9163e0184b ("ubifs: Fix memory leak bug in alloc_ubifs_info() error path")
Reported-by: https://twitter.com/grsecurity/status/1180609139359277056
Signed-off-by: Richard Weinberger <richard@nod.at>
Tested-by: Romain Izard <romain.izard.pro@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17 22:22:53 +01:00
Ben Dooks (Codethink)
7cc7720f06 ubifs: Fix type of sup->hash_algo
The sup->hash_algo is a __le16, and whilst 0xffff is
the same in __le16 and u16, it would be better to use
cpu_to_le16() anyway (which should deal with constants)
and silence the following sparse warning:

fs/ubifs/sb.c:187:32: warning: incorrect type in assignment (different base types)
fs/ubifs/sb.c:187:32:    expected restricted __le16 [usertype] hash_algo
fs/ubifs/sb.c:187:32:    got int

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17 22:22:53 +01:00
Ben Dooks (Codethink)
df22b5b3ec ubifs: Fixed missed le64_to_cpu() in journal
In the ubifs_jnl_write_inode() functon, it calls ubifs_iget()
with xent->inum. The xent->inum is __le64, but the ubifs_iget()
takes native cpu endian.

I think that this should be changed to passing le64_to_cpu(xent->inum)
to fix the following sparse warning:

fs/ubifs/journal.c:902:58: warning: incorrect type in argument 2 (different base types)
fs/ubifs/journal.c:902:58:    expected unsigned long inum
fs/ubifs/journal.c:902:58:    got restricted __le64 [usertype] inum

Fixes: 7959cf3a75 ("ubifs: journal: Handle xattrs like files")
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17 22:21:51 +01:00
Ben Dooks (Codethink)
3cfa4412df ubifs: Force prandom result to __le32
In set_dent_cookie() the result of prandom_u32() is
assinged to an __le32 type. Make this a forced conversion
to remove the following sparse warning:

fs/ubifs/journal.c:506:30: warning: incorrect type in assignment (different base types)
fs/ubifs/journal.c:506:30:    expected restricted __le32 [usertype] cookie
fs/ubifs/journal.c:506:30:    got unsigned int

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17 22:15:29 +01:00
Richard Weinberger
b27b281f9c ubifs: Remove obsolete TODO from dfs_file_write()
AFAICT this kind of problems are no longer possible since
debugfs gained file removal protection via
e9117a5a4b ("debugfs: implement per-file removal protection").

Cc: Christoph Hellwig <hch@lst.de>
Cc: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17 22:15:28 +01:00
Brian Foster
2a2b5932db xfs: fix attr leaf header freemap.size underflow
The leaf format xattr addition helper xfs_attr3_leaf_add_work()
adjusts the block freemap in a couple places. The first update drops
the size of the freemap that the caller had already selected to
place the xattr name/value data. Before the function returns, it
also checks whether the entries array has encroached on a freemap
range by virtue of the new entry addition. This is necessary because
the entries array grows from the start of the block (but end of the
block header) towards the end of the block while the name/value data
grows from the end of the block in the opposite direction. If the
associated freemap is already empty, however, size is zero and the
subtraction underflows the field and causes corruption.

This is reproduced rarely by generic/070. The observed behavior is
that a smaller sized freemap is aligned to the end of the entries
list, several subsequent xattr additions land in larger freemaps and
the entries list expands into the smaller freemap until it is fully
consumed and then underflows. Note that it is not otherwise a
corruption for the entries array to consume an empty freemap because
the nameval list (i.e. the firstused pointer in the xattr header)
starts beyond the end of the corrupted freemap.

Update the freemap size modification to account for the fact that
the freemap entry can be empty and thus stale.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-15 21:15:29 -08:00
Darrick J. Wong
050552cbe0 xfs: fix some memory leaks in log recovery
Fix a few places where we xlog_alloc_buffer a buffer, hit an error, and
then bail out without freeing the buffer.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-11-15 21:15:29 -08:00
Linus Torvalds
b226c9e1f4 for-linus-20191115
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3O/gkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsLBD/47jITnsOf/EU1gqW8vbl+psrPYQN+p68id
 EA5L8fqF7wHg/Anxg9MApDO6noH8BvnfSGFnqxWoE5YcvT/mfj4pVciLMiNG2BwA
 hUiJCwIG8SGCn2MRbaTQpqRnMw8aoTKdJAUWwjZTl/db+X9aCv++Odn4XuAABfh2
 LxIb0ZZBF5M8CfKRHtksuCcGBftEUTrlCzSZ9dXI5tD8EpRNJw/5LDGB6w7inhcZ
 0+X7ENdSQrMKA9ImJunLPUDFejHu4fr4qJdAX67Qai0Wf2dR54eaXmTVO4d4SGcU
 UX0zpNC6bozCq+X/ICnlJkK+ECuR33xFLRIS0S7Xv2Er6n3Ul8N6cb6RRv8Q+o1h
 XG5NfpOH+Atqmdyp9zSRI2c2UVfIfmvmRVIUFM+ZXmdw5oSfUltGLdyNVnKuhzc+
 f2Y3dti96YnT35TIihKcwfqlFuaXfLfCmLYabtVylwlOJ80Sjhgea3IyvwstpJau
 uIs5X8Z5AdBuqufPj4veS3x73DeE7slGmzADcNtUeFb1K5423MJqlQUOeVeJW3x3
 85tS7aot/SoMnA1dtREvceerFP/lIa/02iqX0TYQ7BqsN5oZjQzaiuJkUfV2WNOs
 3TlNRBKF69tpX4+NXxaSm5kC0YHtHIWF0EtNliKM7Yi8WS0tVsy74pDO7otj3j1m
 s10Rr/1seA==
 =wP5w
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20191115' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few fixes that should make it into this release. This contains:

   - io_uring:
        - The timeout command assumes sequence == 0 means that we want
          one completion, but this kind of overloading is unfortunate as
          it prevents users from doing a pure time based wait. Since
          this operation was introduced in this cycle, let's correct it
          now, while we can. (me)
        - One-liner to fix an issue with dependent links and fixed
          buffer reads. The actual IO completed fine, but the link got
          severed since we stored the wrong expected value. (me)
        - Add TIMEOUT to list of opcodes that don't need a file. (Pavel)

   - rsxx missing workqueue destry calls. Old bug. (Chuhong)

   - Fix blk-iocost active list check (Jiufei)

   - Fix impossible-to-hit overflow merge condition, that still hit some
     folks very rarely (Junichi)

   - Fix bfq hang issue from 5.3. This didn't get marked for stable, but
     will go into stable post this merge (Paolo)"

* tag 'for-linus-20191115' of git://git.kernel.dk/linux-block:
  rsxx: add missed destroy_workqueue calls in remove
  iocost: check active_list of all the ancestors in iocg_activate()
  block, bfq: deschedule empty bfq_queues not referred by any process
  io_uring: ensure registered buffer import returns the IO length
  io_uring: Fix getting file for timeout
  block: check bi_size overflow before merge
  io_uring: make timeout sequence == 0 mean no sequence
2019-11-15 13:02:34 -08:00
Al Viro
2fa6b1e01a fs/namei.c: fix missing barriers when checking positivity
Pinned negative dentries can, generally, be made positive
by another thread.  Conditions that prevent that are
	* ->d_lock on dentry in question
	* parent directory held at least shared
	* nobody else could have observed the address of dentry
Most of the places working with those fall into one of those
categories; however, d_lookup() and friends need to be used
with some care.  Fortunately, there's not a lot of call sites,
and with few exceptions all of those fall under one of the
cases above.

Exceptions are all in fs/namei.c - in lookup_fast(), lookup_dcache()
and mountpoint_last().  Another one is lookup_slow() - there
dcache lookup is done with parent held shared, but the result
is used after we'd drop the lock.  The same happens in do_last() -
the lookup (in lookup_one()) is done with parent locked, but
result is used after unlocking.

lookup_fast(), do_last() and mountpoint_last() flat-out reject
negatives.

Most of lookup_dcache() calls are made with parent locked at least
shared; the only exception is lookup_one_len_unlocked().  It might
return pinned negative, needs serious care from callers.  Fortunately,
almost nobody calls it directly anymore; all but two callers have
converted to lookup_positive_unlocked(), which rejects negatives.

lookup_slow() is called by the same lookup_one_len_unlocked() (see
above), mountpoint_last() and walk_component().  In those two negatives
are rejected.

In other words, there is a small set of places where we need to
check carefully if a pinned potentially negative dentry is, in
fact, positive.  After that check we want to be sure that both
->d_inode and type bits in ->d_flags are stable and observed.
The set consists of follow_managed() (where the rejection happens
for lookup_fast(), walk_component() and do_last()), last_mountpoint()
and lookup_positive_unlocked().

Solution:
	1) transition from negative to positive (in __d_set_inode_and_type())
stores ->d_inode, then uses smp_store_release() to set ->d_flags type bits.
	2) aforementioned 3 places in fs/namei.c fetch ->d_flags with
smp_load_acquire() and bugger off if it type bits say "negative".
That way anyone downstream of those checks has dentry know positive pinned,
with ->d_inode and type bits of ->d_flags stable and observed.

I considered splitting off d_lookup_positive(), so that the checks could
be done right there, under ->d_lock.  However, that leads to massive
duplication of rather subtle code in fs/namei.c and fs/dcache.c.  It's
worse than it might seem, thanks to autofs ->d_manage() getting involved ;-/
No matter what, autofs_d_manage()/autofs_d_automount() must live with
the possibility of pinned negative dentry passed their way, becoming
positive under them - that's the intended behaviour when lookup comes
in the middle of automount in progress, so we can't keep them out of
the area that has to deal with those, more's the pity...

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15 13:49:04 -05:00
Al Viro
e840093367 fix dget_parent() fastpath race
We are overoptimistic about taking the fast path there; seeing
the same value in ->d_parent after having grabbed a reference
to that parent does *not* mean that it has remained our parent
all along.

That wouldn't be a big deal (in the end it is our parent and
we have grabbed the reference we are about to return), but...
the situation with barriers is messed up.

We might have hit the following sequence:

d is a dentry of /tmp/a/b
CPU1:					CPU2:
parent = d->d_parent (i.e. dentry of /tmp/a)
					rename /tmp/a/b to /tmp/b
					rmdir /tmp/a, making its dentry negative
grab reference to parent,
end up with cached parent->d_inode (NULL)
					mkdir /tmp/a, rename /tmp/b to /tmp/a/b
recheck d->d_parent, which is back to original
decide that everything's fine and return the reference we'd got.

The trouble is, caller (on CPU1) will observe dget_parent()
returning an apparently negative dentry.  It actually is positive,
but CPU1 has stale ->d_inode cached.

Use d->d_seq to see if it has been moved instead of rechecking ->d_parent.
NOTE: we are *NOT* going to retry on any kind of ->d_seq mismatch;
we just go into the slow path in such case.  We don't wait for ->d_seq
to become even either - again, if we are racing with renames, we
can bloody well go to slow path anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15 13:49:04 -05:00
Al Viro
6c2d4798a8 new helper: lookup_positive_unlocked()
Most of the callers of lookup_one_len_unlocked() treat negatives are
ERR_PTR(-ENOENT).  Provide a helper that would do just that.  Note
that a pinned positive dentry remains positive - it's ->d_inode is
stable, etc.; a pinned _negative_ dentry can become positive at any
point as long as you are not holding its parent at least shared.
So using lookup_one_len_unlocked() needs to be careful;
lookup_positive_unlocked() is safer and that's what the callers
end up open-coding anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15 13:49:04 -05:00
Al Viro
d41efb522e fs/namei.c: pull positivity check into follow_managed()
There are 4 callers; two proceed to check if result is positive and
fail with ENOENT if it isn't; one (in handle_lookup_down()) is
guaranteed to yield positive and one (in lookup_fast()) is _preceded_
by positivity check.

However, follow_managed() on a negative dentry is a (fairly cheap)
no-op on anything other than autofs.  And negative autofs dentries
are never hashed, so lookup_fast() is not going to run into one
of those.  Moreover, successful follow_managed() on a _positive_
dentry never yields a negative one (and we significantly rely upon
that in callers of lookup_fast()).

In other words, we can easily transpose the positivity check and
the call of follow_managed() in lookup_fast().  And that allows
to fold the positivity check *into* follow_managed(), simplifying
life for the code downstream of its calls.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15 13:49:04 -05:00
Linus Torvalds
875fef493f Two fixes for the buffered reads and O_DIRECT writes serialization
patch that went into -rc1 and a fixup for a bogus warning on older
 gcc versions.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl3O2RMTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi+k6CACf0hUTyWcJaiH3WAmkpKOnZVG//Ghv
 +hDWskib0gSilW+mx8Cjsndb5rXVuE4MhZ9P1VD1MMhhfVlfTUspCPG6cIQ3B3gd
 jEVLHDALaMc/tpKwa6EbxvxQRAL5D/2Umh8aK1kVMX2U9R6KKfMiRVToHVPewSkS
 eM3HJuV0kUonnD6glHyie1iwI9iFkDgt+eTJR1hpiFx26y6TwVCH5RNNvZGr0Tcf
 KMLgwAHxIowx0SxblvbJTMf1iIoPJNiUuZyBoo0Hli9hI7S/b4XfARAkD9eud02d
 4shv7D9Js0V1tKDsfV/c8UpnOgBTGEY4AEXIN3Mm6Gk6q9pMnobWt+l8
 =kIHR
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.4-rc8' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Two fixes for the buffered reads and O_DIRECT writes serialization
  patch that went into -rc1 and a fixup for a bogus warning on older gcc
  versions"

* tag 'ceph-for-5.4-rc8' of git://github.com/ceph/ceph-client:
  rbd: silence bogus uninitialized warning in rbd_object_map_update_finish()
  ceph: increment/decrement dio counter on async requests
  ceph: take the inode lock before acquiring cap refs
2019-11-15 10:30:24 -08:00
David Howells
a28f239e29 afs: Fix race in commit bulk status fetch
When a lookup is done, the afs filesystem will perform a bulk status-fetch
operation on the requested vnode (file) plus the next 49 other vnodes from
the directory list (in AFS, directory contents are downloaded as blobs and
parsed locally).  When the results are received, it will speculatively
populate the inode cache from the extra data.

However, if the lookup races with another lookup on the same directory, but
for a different file - one that's in the 49 extra fetches, then if the bulk
status-fetch operation finishes first, it will try and update the inode
from the other lookup.

If this other inode is still in the throes of being created, however, this
will cause an assertion failure in afs_apply_status():

	BUG_ON(test_bit(AFS_VNODE_UNSET, &vnode->flags));

on or about fs/afs/inode.c:175 because it expects data to be there already
that it can compare to.

Fix this by skipping the update if the inode is being created as the
creator will presumably set up the inode with the same information.

Fixes: 39db9815da ("afs: Fix application of the results of a inline bulk status fetch")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15 10:28:02 -08:00
Bob Peterson
d99724c3c3 gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS
This patch closes a timing window in which two processes compete
and overlap in the execution of do_xmote for the same glock:

             Process A                              Process B
   ------------------------------------   -----------------------------
1. Grabs gl_lockref and calls do_xmote
2.                                        Grabs gl_lockref but is blocked
3. Sets GLF_INVALIDATE_IN_PROGRESS
4. Unlocks gl_lockref
5.                                        Calls do_xmote
6. Call glops->go_sync
7. test_and_clear_bit GLF_DIRTY
8. Call gfs2_log_flush                    Call glops->go_sync
9. (slow IO, so it blocks a long time)    test_and_clear_bit GLF_DIRTY
                                          It's not dirty (step 7) returns
10.                                       Tests GLF_INVALIDATE_IN_PROGRESS
11.                                       Calls go_inval (rgrp_go_inval)
12.                                       gfs2_rgrp_relse does brelse
13.                                       truncate_inode_pages_range
14.                                       Calls lm_lock UN

In step 14 we've just told dlm to give the glock to another node
when, in fact, process A has not finished the IO and synced all
buffer_heads to disk and make sure their revokes are done.

This patch fixes the problem by changing the GLF_INVALIDATE_IN_PROGRESS
to use test_and_set_bit, and if the bit is already set, process B just
ignores it and trusts that process A will do the do_xmote in the proper
order.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-15 18:21:59 +01:00
Bob Peterson
52b1cdcb7a gfs2: Abort gfs2_freeze if io error is seen
Before this patch, an io error, such as -EIO writing to the journal
would cause function gfs2_freeze to go into an infinite loop,
continuously retrying the freeze operation. But nothing ever clears
the -EIO except unmount after withdraw, which is impossible if the
freeze operation never ends (fails). Instead you get:

[ 6499.767994] gfs2: fsid=dm-32.0: error freezing FS: -5
[ 6499.773058] gfs2: fsid=dm-32.0: retrying...
[ 6500.791957] gfs2: fsid=dm-32.0: error freezing FS: -5
[ 6500.797015] gfs2: fsid=dm-32.0: retrying...

This patch adds a check for -EIO in gfs2_freeze, and if seen, it
dequeues the freeze glock, aborts the loop and returns the error.
Also, there's no need to pass the freeze holder to function
gfs2_lock_fs_check_clean since it's only called in one place and
it's a well-known superblock pointer, so this simplifies that.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-15 17:57:30 +01:00
Linus Torvalds
b4c0800e42 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs fixes from Al Viro:
 "Assorted fixes all over the place; some of that is -stable fodder,
  some regressions from the last window"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either
  ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
  ecryptfs: fix unlink and rmdir in face of underlying fs modifications
  audit_get_nd(): don't unlock parent too early
  exportfs_decode_fh(): negative pinned may become positive without the parent locked
  cgroup: don't put ERR_PTR() into fc->root
  autofs: fix a leak in autofs_expire_indirect()
  aio: Fix io_pgetevents() struct __compat_aio_sigset layout
  fs/namespace.c: fix use-after-free of mount in mnt_warn_timestamp_expiry()
2019-11-15 08:44:08 -08:00
David Howells
3c0edea9b2 pipe: Remove sync on wake_ups 2019-11-15 16:22:54 +00:00
David Howells
cefa80ced5 pipe: Increase the writer-wakeup threshold to reduce context-switch count
Increase the threshold at which the reader sends a wake event to the
writers in the queue such that the queue must be half empty before the wake
is issued rather than the wake being issued when just a single slot
available.

This reduces the number of context switches in the tests significantly,
without altering the amount of work achieved.  With my pipe-bench program,
there's a 20% reduction versus an unpatched kernel.

Suggested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
8df441294d pipe: Check for ring full inside of the spinlock in pipe_write()
Make pipe_write() check to see if the ring has become full between it
taking the pipe mutex, checking the ring status and then taking the
spinlock.

This can happen if a notification is written into the pipe as that happens
without the pipe mutex.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
7e25a73f1a pipe: Remove redundant wakeup from pipe_write()
Remove a redundant wakeup from pipe_write().

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
a194dfe6e6 pipe: Rearrange sequence in pipe_write() to preallocate slot
Rearrange the sequence in pipe_write() so that the allocation of the new
buffer, the allocation of a ring slot and the attachment to the ring is
done under the pipe wait spinlock and then the lock is dropped and the
buffer can be filled.

The data copy needs to be done with the spinlock unheld and irqs enabled,
so the lock needs to be dropped first.  However, the reader can't progress
as we're holding pipe->mutex.

We also need to drop the lock as that would impact others looking at the
pipe waitqueue, such as poll(), the consumer and a future kernel message
writer.

We just abandon the preallocated slot if we get a copy error.  Future
writes may continue it and a future read will eventually recycle it.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
8446487feb pipe: Conditionalise wakeup in pipe_read()
Only do a wakeup in pipe_read() if we made space in a completely full
buffer.  The producer shouldn't be waiting on pipe->wait otherwise.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
b667b86734 pipe: Advance tail pointer inside of wait spinlock in pipe_read()
Advance the pipe ring tail pointer inside of wait spinlock in pipe_read()
so that the pipe can be written into with kernel notifications from
contexts where pipe->mutex cannot be taken.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
6718b6f855 pipe: Allow pipes to have kernel-reserved slots
Split pipe->ring_size into two numbers:

 (1) pipe->ring_size - indicates the hard size of the pipe ring.

 (2) pipe->max_usage - indicates the maximum number of pipe ring slots that
     userspace orchestrated events can fill.

This allows for a pipe that is both writable by the general kernel
notification facility and by userspace, allowing plenty of ring space for
notifications to be added whilst preventing userspace from being able to
pin too much unswappable kernel space.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
Arnd Bergmann
bde9e963af y2038: timerfd: Use timespec64 internally
timerfd_show() uses a 'struct itimerspec' internally, but that is
deprecated because of the time_t overflow and a conflict with the glibc
type of the same name that is now incompatible in user space.

Use a pair of timespec64 variables instead as a simple replacement.

As this removes the last use of itimerspec from the kernel, allowing the
removal of the definition from the uapi headers along with timespec and
timeval later.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:30 +01:00
Arnd Bergmann
e2bb80d55d y2038: elfcore: Use __kernel_old_timeval for process times
We store elapsed time for a crashed process in struct elf_prstatus using
'timeval' structures. Once glibc starts using 64-bit time_t, this becomes
incompatible with the kernel's idea of timeval since the structure layout
no longer matches on 32-bit architectures.

This changes the definition of the elf_prstatus structure to use
__kernel_old_timeval instead, which is hardcoded to the currently used
binary layout. There is no risk of overflow in y2038 though, because
the time values are all relative times, and can store up to 68 years
of process elapsed time.

There is a risk of applications breaking at build time when they
use the new kernel headers and expect the type to be exactly 'timeval'
rather than a structure that has the same fields as before. Those
applications have to be modified to deal with 64-bit time_t anyway.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:29 +01:00
Arnd Bergmann
75d319c06e y2038: syscalls: change remaining timeval to __kernel_old_timeval
All of the remaining syscalls that pass a timeval (gettimeofday, utime,
futimesat) can trivially be changed to pass a __kernel_old_timeval
instead, which has a compatible layout, but avoids ambiguity with
the timeval type in user space.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:29 +01:00
Arnd Bergmann
3ca47e958a y2038: remove CONFIG_64BIT_TIME
The CONFIG_64BIT_TIME option is defined on all architectures, and can
be removed for simplicity now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:27 +01:00
yangerkun
565333a155 ext4: fix a bug in ext4_wait_for_tail_page_commit
No need to wait for any commit once the page is fully truncated.
Besides, it may confuse e.g. concurrent ext4_writepage() with the page
still be dirty (will be cleared by truncate_pagecache() in
ext4_setattr()) but buffers has been freed; and then trigger a bug
show as below:

[   26.057508] ------------[ cut here ]------------
[   26.058531] kernel BUG at fs/ext4/inode.c:2134!
...
[   26.088130] Call trace:
[   26.088695]  ext4_writepage+0x914/0xb28
[   26.089541]  writeout.isra.4+0x1b4/0x2b8
[   26.090409]  move_to_new_page+0x3b0/0x568
[   26.091338]  __unmap_and_move+0x648/0x988
[   26.092241]  unmap_and_move+0x48c/0xbb8
[   26.093096]  migrate_pages+0x220/0xb28
[   26.093945]  kernel_mbind+0x828/0xa18
[   26.094791]  __arm64_sys_mbind+0xc8/0x138
[   26.095716]  el0_svc_common+0x190/0x490
[   26.096571]  el0_svc_handler+0x60/0xd0
[   26.097423]  el0_svc+0x8/0xc

Run the procedure (generate by syzkaller) parallel with ext3.

void main()
{
	int fd, fd1, ret;
	void *addr;
	size_t length = 4096;
	int flags;
	off_t offset = 0;
	char *str = "12345";

	fd = open("a", O_RDWR | O_CREAT);
	assert(fd >= 0);

	/* Truncate to 4k */
	ret = ftruncate(fd, length);
	assert(ret == 0);

	/* Journal data mode */
	flags = 0xc00f;
	ret = ioctl(fd, _IOW('f', 2, long), &flags);
	assert(ret == 0);

	/* Truncate to 0 */
	fd1 = open("a", O_TRUNC | O_NOATIME);
	assert(fd1 >= 0);

	addr = mmap(NULL, length, PROT_WRITE | PROT_READ,
					MAP_SHARED, fd, offset);
	assert(addr != (void *)-1);

	memcpy(addr, str, 5);
	mbind(addr, length, 0, 0, 0, MPOL_MF_MOVE);
}

And the bug will be triggered once we seen the below order.

reproduce1                         reproduce2

...                            |   ...
truncate to 4k                 |
change to journal data mode    |
                               |   memcpy(set page dirty)
truncate to 0:                 |
ext4_setattr:                  |
...                            |
ext4_wait_for_tail_page_commit |
                               |   mbind(trigger bug)
truncate_pagecache(clean dirty)|   ...
...                            |

mbind will call ext4_writepage() since the page still be dirty, and then
report the bug since the buffers has been free. Fix it by return
directly once offset equals to 0 which means the page has been fully
truncated.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Link: https://lore.kernel.org/r/20190919063508.1045-1-yangerkun@huawei.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14 22:22:17 -05:00
Gao Xiang
5500221ea1 ext4: bio_alloc with __GFP_DIRECT_RECLAIM never fails
Similar to [1] [2], bio_alloc with __GFP_DIRECT_RECLAIM flags
guarantees bio allocation under some given restrictions, as
stated in block/bio.c and fs/direct-io.c So here it's ok to
not check for NULL value from bio_alloc().

[1] https://lore.kernel.org/r/20191030035518.65477-1-gaoxiang25@huawei.com
[2] https://lore.kernel.org/r/20190830162812.GA10694@infradead.org
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20191031092315.139267-1-gaoxiang25@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14 22:19:11 -05:00
Chengguang Xu
ebc11f7b1f ext4: code cleanup for get_next_id
Now the checks in ext4_get_next_id() and dquot_get_next_id()
are almost the same, so just call dquot_get_next_id() instead
of ext4_get_next_id().

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Link: https://lore.kernel.org/r/20191006103028.31299-1-cgxu519@mykernel.net
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14 22:15:43 -05:00
Jan Kara
f4c2d372b8 ext4: fix leak of quota reservations
Commit 8fcc3a5806 ("ext4: rework reserved cluster accounting when
invalidating pages") moved freeing of delayed allocation reservations
from dirty page invalidation time to time when we evict corresponding
status extent from extent status tree. For inodes which don't have any
blocks allocated this may actually happen only in ext4_clear_blocks()
which is after we've dropped references to quota structures from the
inode. Thus reservation of quota leaked. Fix the problem by clearing
quota information from the inode only after evicting extent status tree
in ext4_clear_inode().

Link: https://lore.kernel.org/r/20191108115420.GI20863@quack2.suse.cz
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 8fcc3a5806 ("ext4: rework reserved cluster accounting when invalidating pages")
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14 19:25:47 -05:00
Olof Johansson
1e1a76ed9a ext4: remove unused variable warning in parse_options()
Commit c33fbe8f67 ("ext4: Enable blocksize < pagesize for
dioread_nolock") removed the only user of 'sbi' outside of the ifdef,
so it caused a new warning:

fs/ext4/super.c:2068:23: warning: unused variable 'sbi' [-Wunused-variable]

Fixes: c33fbe8f67 ("ext4: Enable blocksize < pagesize for dioread_nolock")
Signed-off-by: Olof Johansson <olof@lixom.net>
Link: https://lore.kernel.org/r/20191111022523.34256-1-olof@lixom.net
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
2019-11-14 17:59:05 -05:00
Chandan Rajendra
196624e192 ext4: Enable encryption for subpage-sized blocks
Now that we have the code to support encryption for subpage-sized
blocks, this commit removes the conditional check in filesystem mount
code.

The commit also changes the support statement in
Documentation/filesystems/fscrypt.rst to reflect the fact that
encryption on filesystems with blocksize less than page size now works.

[EB: Tested with 'gce-xfstests -c ext4/encrypt_1k -g auto', using the
new "encrypt_1k" config I created.  All tests pass except for those that
already fail or are excluded with the encrypt or 1k configs, and 2 tests
that try to create 1023-byte symlinks which fails since encrypted
symlinks are limited to blocksize-3 bytes.  Also ran the dedicated
encryption tests using 'kvm-xfstests -c ext4/1k -g encrypt'; all pass,
including the on-disk ciphertext verification tests.]

Signed-off-by: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191023033312.361355-3-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14 16:40:45 -05:00
Eric Biggers
31fb992ce6 fs/buffer.c: support fscrypt in block_read_full_page()
After each filesystem block (as represented by a buffer_head) has been
read from disk by block_read_full_page(), decrypt it if needed.  The
decryption is done on the fscrypt_read_workqueue.

This is the final change needed to support ext4 encryption with
blocksize != PAGE_SIZE, and it's a fairly small change now that
CONFIG_FS_ENCRYPTION is a bool and fs/crypto/ exposes functions to
decrypt individual blocks and to enqueue work on the fscrypt workqueue.

Don't try to add fs-verity support yet, as the fs/verity/ support layer
isn't ready for sub-page blocks yet.  Just add fscrypt support for now.

Almost all the new code is compiled away when CONFIG_FS_ENCRYPTION=n.

Cc: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191023033312.361355-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14 16:40:45 -05:00
Jens Axboe
eac406c61c io_uring: make POLL_ADD/POLL_REMOVE scale better
One of the obvious use cases for these commands is networking, where
it's not uncommon to have tons of sockets open and polled for. The
current implementation uses a list for insertion and lookup, which works
fine for file based use cases where the count is usually low, it breaks
down somewhat for higher number of files / sockets. A test case with
30k sockets being polled for and cancelled takes:

real    0m6.968s
user    0m0.002s
sys     0m6.936s

with the patch it takes:

real    0m0.233s
user    0m0.010s
sys     0m0.176s

If you go to 50k sockets, it gets even more abysmal with the current
code:

real    0m40.602s
user    0m0.010s
sys     0m40.555s

with the patch it takes:

real    0m0.398s
user    0m0.000s
sys     0m0.341s

Change is pretty straight forward, just replace the cancel_list with
a red/black tree instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 12:09:58 -07:00
Bob Peterson
60528afa78 gfs2: Don't loop forever in gfs2_freeze if withdrawn
Before this patch, function gfs2_freeze would loop forever if the
filesystem it tries to freeze is withdrawn. That's because function
gfs2_lock_fs_check_clean tries to enqueue the glock of the journal and
the gfs2_glock returns -EIO because you can't enqueue a journaled glock
after a withdraw.

Move the check for file system withdraw inside the loop so that the loop
can end when withdraw occurs.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-14 19:47:05 +01:00
Bob Peterson
f155f5e010 gfs2: fix infinite loop in gfs2_ail1_flush on io error
Before this patch, an IO error encountered in function gfs2_ail1_flush
would cause a deadlock: because of the io error (and its resulting
withdrawn state), buffers stopped being written to the journal.
Buffers would remain on the ail1 list, so gfs2_ail1_start_one would
return 1 to indicate dirty buffers were still on the ail1 list.
However, when function gfs2_ail1_flush got a non-zero return code,
it would goto restart to retry the writes, which meant it would never
finish, and thus the infinite loop.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-14 19:46:43 +01:00
Bob Peterson
eb43e660c0 gfs2: Introduce function gfs2_withdrawn
Add function gfs2_withdrawn and replace all checks for the SDF_WITHDRAWN
bit to call it. This does not change the logic or function of gfs2, and
it facilitates later improvements to the withdraw sequence.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-14 19:46:18 +01:00
Jeff Layton
6a81749ebe ceph: increment/decrement dio counter on async requests
Ceph can in some cases issue an async DIO request, in which case we can
end up calling ceph_end_io_direct before the I/O is actually complete.
That may allow buffered operations to proceed while DIO requests are
still in flight.

Fix this by incrementing the i_dio_count when issuing an async DIO
request, and decrement it when tearing down the aio_req.

Fixes: 321fe13c93 ("ceph: add buffered/direct exclusionary locking for reads and writes")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-14 18:44:51 +01:00
Jeff Layton
a81bc3102b ceph: take the inode lock before acquiring cap refs
Most of the time, we (or the vfs layer) takes the inode_lock and then
acquires caps, but ceph_read_iter does the opposite, and that can lead
to a deadlock.

When there are multiple clients treading over the same data, we can end
up in a situation where a reader takes caps and then tries to acquire
the inode_lock. Another task holds the inode_lock and issues a request
to the MDS which needs to revoke the caps, but that can't happen until
the inode_lock is unwedged.

Fix this by having ceph_read_iter take the inode_lock earlier, before
attempting to acquire caps.

Fixes: 321fe13c93 ("ceph: add buffered/direct exclusionary locking for reads and writes")
Link: https://tracker.ceph.com/issues/36348
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-14 18:44:51 +01:00
Bob Peterson
fe5e7ba11f gfs2: fix glock reference problem in gfs2_trans_remove_revoke
Commit 9287c6452d fixed a situation in which gfs2 could use a glock
after it had been freed. To do that, it temporarily added a new glock
reference by calling gfs2_glock_hold in function gfs2_add_revoke.
However, if the bd element was removed by gfs2_trans_remove_revoke, it
failed to drop the additional reference.

This patch adds logic to gfs2_trans_remove_revoke to properly drop the
additional glock reference.

Fixes: 9287c6452d ("gfs2: Fix occasional glock use-after-free")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-14 16:51:00 +01:00
Bob Peterson
feed98a8e5 gfs2: make gfs2_log_shutdown static
Function gfs2_log_shutdown is only called from within log.c. This
patch removes the extern declaration and makes it static.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-14 16:50:56 +01:00
Jens Axboe
021d1cdda3 io-wq: remove now redundant struct io_wq_nulls_list
Since we don't iterate these lists anymore after commit:

e61df66c69 ("io-wq: ensure free/busy list browsing see all items")

we don't need to retain the nulls value we use for them. That means it's
pretty pointless to wrap the hlist_nulls_head in a structure, so get rid
of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 08:02:19 -07:00
Christoph Hellwig
979c690d9a block: move clearing bd_invalidated into check_disk_size_change
Both callers of check_disk_size_change clear bd_invalidate directly
after the call, so move the clearing into check_disk_size_change
itself.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 07:44:01 -07:00
Christoph Hellwig
f0b870df80 block: remove (__)blkdev_reread_part as an exported API
In general drivers should never mess with partition tables directly.
Unfortunately s390 and loop do for somewhat historic reasons, but they
can use bdev_disk_changed directly instead when we export it as they
satisfy the sanity checks we have in __blkdev_reread_part.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>	[dasd]
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 07:43:59 -07:00
Christoph Hellwig
142fe8f4bb block: fix bdev_disk_changed for non-partitioned devices
We still have to set the capacity to 0 if invalidating or call
revalidate_disk if not even if the disk has no partitions.  Fix
that by merging rescan_partitions into bdev_disk_changed and just
stubbing out blk_add_partitions and blk_drop_partitions for
non-partitioned devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 07:43:53 -07:00
Christoph Hellwig
a1548b6744 block: move rescan_partitions to fs/block_dev.c
Large parts of rescan_partitions aren't about partitions, and
moving it to block_dev.c will allow for some further cleanups by
merging it into its only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 07:43:21 -07:00
Christoph Hellwig
6917d06899 block: merge invalidate_partitions into rescan_partitions
A lot of the logic in invalidate_partitions and rescan_partitions is
shared.  Merge the two functions to simplify things.  There is a small
behavior change in that we now send the kevent change notice also if we
were not invalidating but no partitions were found, which seems like
the right thing to do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 07:42:41 -07:00
Pavel Begunkov
a320e9fa1e io_uring: Fix getting file for non-fd opcodes
For timeout requests and bunch of others io_uring tries to grab a file
with specified fd, which is usually stdin/fd=0.
Update io_op_needs_file()

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 19:41:01 -07:00
Bob Liu
9d858b2148 io_uring: introduce req_need_defer()
Makes the code easier to read.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 19:41:01 -07:00
Bob Liu
2f6d9b9d63 io_uring: clean up io_uring_cancel_files()
We don't use the return value anymore, drop it. Also drop the
unecessary double cancel_req value check.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 19:41:01 -07:00
Jens Axboe
e61df66c69 io-wq: ensure free/busy list browsing see all items
We have two lists for workers in io-wq, a busy and a free list. For
certain operations we want to browse all workers, and we currently do
that by browsing the two separate lists. But since these lists are RCU
protected, we can potentially miss workers if they move between the two
lists while we're browsing them.

Add a third list, all_list, that simply holds all workers. A worker is
added to that list when it starts, and removed when it exits. This makes
the worker iteration cleaner, too.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 19:40:57 -07:00
Darrick J. Wong
f368b29ba9 xfs: fix another missing include
Fix missing include of xfs_filestream.h in xfs_filestream.c so that we
actually check the function declarations against the definitions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-11-13 18:22:41 -08:00
Christoph Hellwig
8234532fd4 xfs: remove XFS_IOC_FSSETDM and XFS_IOC_FSSETDM_BY_HANDLE
Thes ioctls set DMAPI specific flags in the on-disk inode, but there is
no way to actually ever query those flags.  The only known user is
xfsrestore with the -D option, which is documented to be only useful
inside a DMAPI enviroment, which isn't supported by upstream XFS.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:41 -08:00
YueHaibing
eb0d21637f xfs: remove duplicated include from xfs_dir2_data.c
Remove duplicated include.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:41 -08:00
Eric Sandeen
a55cefccaa xfs: remove unused structure members & simple typedefs
Remove some unused typedef'd simple types, and some unused
structure members.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:41 -08:00
Eric Sandeen
35dab307c8 xfs: remove unused typedef definitions
Remove some typdefs for type_t's that are no longer referred to
by their typedef'd types.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:40 -08:00
Pavel Reichl
1cc95e6f0d xfs: Replace function declaration by actual definition
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix typo in subject line]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:40 -08:00
Pavel Reichl
d0bdfb1069 xfs: remove the xfs_qoff_logitem_t typedef
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix a comment]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:28 -08:00
Pavel Reichl
fd8b81dbbb xfs: remove the xfs_dq_logitem_t typedef
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:26 -08:00
Pavel Reichl
c072fbefe4 xfs: remove the xfs_quotainfo_t typedef
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 18:22:23 -08:00
Jens Axboe
5e559561a8 io_uring: ensure registered buffer import returns the IO length
A test case was reported where two linked reads with registered buffers
failed the second link always. This is because we set the expected value
of a request in req->result, and if we don't get this result, then we
fail the dependent links. For some reason the registered buffer import
returned -ERROR/0, while the normal import returns -ERROR/length. This
broke linked commands with registered buffers.

Fix this by making io_import_fixed() correctly return the mapped length.

Cc: stable@vger.kernel.org # v5.3
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 16:15:14 -07:00
Pavel Begunkov
5683e5406e io_uring: Fix getting file for timeout
For timeout requests io_uring tries to grab a file with specified fd,
which is usually stdin/fd=0.
Update io_op_needs_file()

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:25:57 -07:00
Jens Axboe
36c2f9223e io-wq: ensure we have a stable view of ->cur_work for cancellations
worker->cur_work is currently protected by the lock of the wqe that the
worker belongs to. When we send a signal to a worker, we need a stable
view of ->cur_work, so we need to hold that lock. But this doesn't work
so well, since we have the opposite order potentially on queueing work.
If POLL_ADD is used with a signalfd, then io_poll_wake() is called with
the signal lock, and that sometimes needs to insert work items.

Add a specific worker lock that protects the current work item. Then we
can guarantee that the task we're sending a signal is currently
processing the exact work we think it is.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 13:51:54 -07:00
Eric Biggers
924e319416 f2fs: support STATX_ATTR_VERITY
Set the STATX_ATTR_VERITY bit when the statx() system call is used on a
verity file on f2fs.

Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-13 12:15:34 -08:00
Eric Biggers
1f60719552 ext4: support STATX_ATTR_VERITY
Set the STATX_ATTR_VERITY bit when the statx() system call is used on a
verity file on ext4.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-13 12:15:34 -08:00
Linus Torvalds
afd7a71872 for-5.4-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3MUkoACgkQxWXV+ddt
 WDvAEQ//fEHZ51NbIMwJNltqF4mr6Oao0M5u0wejEgiXzmR9E1IuHUgVK+KDQmSu
 wZl/y+RTlQC0TiURyStaVFBreXEqiuB79my9u4iDeNv/4UJQB42qpmYB4EuviDgB
 mxb9bFpWTLkO6Oc+vMrGF3BOmVsQKlq2nOua25g8VFtApQ6uiEfbwBOslCcC8kQB
 ZpNBl6x74xz/VWNWZnRStBfwYjRitKNDVU6dyIyRuLj8cktqfGBxGtx7/w0wDiZT
 kPR1bNtdpy3Ndke6H/0G6plRWi9kENqcN43hvrz54IKh2l+Jd2/as51j4Qq2tJU9
 KaAnJzRaSePxc2m0SqtgZTvc2BYSOg7dqaCyHxBB0CUBdTdJdz2TVZ2KM9MiLns4
 1haHBLo4l8o8zeYZpW05ac6OXKY4f8qsjWPEGshn4FDbq0TrHQzYxAF3c0X3hPag
 SnuvilgYUuYal+n87qinePg/ZmVrrBXPRycpQnn7FxqezJbf/2WUEojUVQnreU05
 mdp8mulxQxyFhgEvO7K1uDtlP8bqW69IO9M//6IWzGNKTDK2SRI08ULplghqgyna
 8SG0+y9w26r8UIWDhuvPbdfUMSG3kEH8yLFK84AFDMVJJxOnfznE3sC8sGOiP5q9
 OUkl8l7bhDkyAdWZY57gGUobebdPfnLxRV9A+LZQ2El1kSOEK18=
 =xzXs
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "A fix for an older bug that has started to show up during testing
  (because of an updated test for rename exchange).

  It's an in-memory corruption caused by local variable leaking out of
  the function scope"

* tag 'for-5.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix log context list corruption after rename exchange operation
2019-11-13 12:06:10 -08:00
Pavel Reichl
aefe69a45d xfs: remove the xfs_disk_dquot_t and xfs_dquot_t
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix some of the comments]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
Arnd Bergmann
e8777b27ca xfs: avoid time_t in user api
The ioctl definitions for XFS_IOC_SWAPEXT, XFS_IOC_FSBULKSTAT and
XFS_IOC_FSBULKSTAT_SINGLE are part of libxfs and based on time_t.

The definition for time_t differs between current kernels and coming
32-bit libc variants that define it as 64-bit. For most ioctls, that
means the kernel has to be able to handle two different command codes
based on the different structure sizes.

The same solution could be applied for XFS_IOC_SWAPEXT, but it would
not work for XFS_IOC_FSBULKSTAT and XFS_IOC_FSBULKSTAT_SINGLE because
the structure with the time_t is passed through an indirect pointer,
and the command number itself is based on struct xfs_fsop_bulkreq,
which does not differ based on time_t.

This means any solution that can be applied requires a change of the
ABI definition in the xfs_fs.h header file, as well as doing the same
change in any user application that contains a copy of this header.

The usual solution would be to define a replacement structure and
use conditional compilation for the ioctl command codes to use
one or the other, such as

 #define XFS_IOC_FSBULKSTAT_OLD _IOWR('X', 101, struct xfs_fsop_bulkreq)
 #define XFS_IOC_FSBULKSTAT_NEW _IOWR('X', 129, struct xfs_fsop_bulkreq)
 #define XFS_IOC_FSBULKSTAT ((sizeof(time_t) == sizeof(__kernel_long_t)) ? \
			     XFS_IOC_FSBULKSTAT_OLD : XFS_IOC_FSBULKSTAT_NEW)

After this, the kernel would be able to implement both
XFS_IOC_FSBULKSTAT_OLD and XFS_IOC_FSBULKSTAT_NEW handlers on
32-bit architectures with the correct ABI for either definition
of time_t.

However, as long as two observations are true, a much simpler solution
can be used:

1. xfsprogs is the only user space project that has a copy of this header
2. xfsprogs already has a replacement for all three affected ioctl commands,
   based on the xfs_bulkstat structure to pass 64-bit timestamps
   regardless of the architecture

Based on those assumptions, changing xfs_bstime to use __kernel_long_t
instead of time_t in both the kernel and in xfsprogs preserves the current
ABI for any libc definition of time_t and solves the problem of passing
64-bit timestamps to 32-bit user space.

If either of the two assumptions is invalid, more discussion is needed
for coming up with a way to fix as much of the affected user space
code as possible.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
kaixuxia
93597ae8da xfs: Fix deadlock between AGI and AGF when target_ip exists in xfs_rename()
When target_ip exists in xfs_rename(), the xfs_dir_replace() call may
need to hold the AGF lock to allocate more blocks, and then invoking
the xfs_droplink() call to hold AGI lock to drop target_ip onto the
unlinked list, so we get the lock order AGF->AGI. This would break the
ordering constraint on AGI and AGF locking - inode allocation locks
the AGI, then can allocate a new extent for new inodes, locking the
AGF after the AGI.

In this patch we check whether the replace operation need more
blocks firstly. If so, acquire the agi lock firstly to preserve
locking order(AGI/AGF). Actually, the locking order problem only
occurs when we are locking the AGI/AGF of the same AG. For multiple
AGs the AGI lock will be released after the transaction committed.

Signed-off-by: kaixuxia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: reword the comment]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
Christoph Hellwig
048a35d2f0 xfs: don't reset the "inode core" in xfs_iread
We have the exact same memset in xfs_inode_alloc, which is always called
just before xfs_iread.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
Christoph Hellwig
de7a866fd4 xfs: merge the projid fields in struct xfs_icdinode
There is no point in splitting the fields like this in an purely
in-memory structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
Christoph Hellwig
8d2d878db8 xfs: use a struct timespec64 for the in-core crtime
struct xfs_icdinode is purely an in-memory data structure, so don't use
a log on-disk structure for it.  This simplifies the code a bit, and
also reduces our include hell slightly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix a minor indenting problem in xfs_trans_ichgtime]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
Christoph Hellwig
d8d11fc703 xfs: devirtualize ->m_dirnameops
Instead of causing a relatively expensive indirect call for each
hashing and comparism of a file name in a directory just use an
inline function and a simple branch on the ASCII CI bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix unused variable warning]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
Christoph Hellwig
537dabcfdb xfs: remove the unused m_chsize field
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13 11:13:45 -08:00
Darrick J. Wong
a71895c5da xfs: convert open coded corruption check to use XFS_IS_CORRUPT
Convert the last of the open coded corruption check and report idioms to
use the XFS_IS_CORRUPT macro.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-13 11:08:01 -08:00
Jens Axboe
7d7230652e io_wq: add get/put_work handlers to io_wq_create()
For cancellation, we need to ensure that the work item stays valid for
as long as ->cur_work is valid. Right now we can't safely dereference
the work item even under the wqe->lock, because while the ->cur_work
pointer will remain valid, the work could be completing and be freed
in parallel.

Only invoke ->get/put_work() on items we know that the caller queued
themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed
when we're queueing a flush item, for instance.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 11:37:54 -07:00
Jens Axboe
15dff286d0 io_uring: check for validity of ->rings in teardown
Normally the rings are always valid, the exception is if we failed to
allocate the rings at setup time. syzbot reports this:

RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229
RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d
RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 8903 Comm: syz-executor410 Not tainted 5.4.0-rc7-next-20191113
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a
RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100
RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d
R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0
R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000
FS:  0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  io_cqring_overflow_flush+0x6b9/0xa90 fs/io_uring.c:673
  io_ring_ctx_wait_and_kill+0x24f/0x7c0 fs/io_uring.c:4260
  io_uring_create fs/io_uring.c:4600 [inline]
  io_uring_setup+0x1256/0x1cc0 fs/io_uring.c:4626
  __do_sys_io_uring_setup fs/io_uring.c:4639 [inline]
  __se_sys_io_uring_setup fs/io_uring.c:4636 [inline]
  __x64_sys_io_uring_setup+0x54/0x80 fs/io_uring.c:4636
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441229
Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229
RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d
RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
Modules linked in:
---[ end trace b0f5b127a57f623f ]---
RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a
RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100
RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d
R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0
R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000
FS:  0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

which is exactly the case of failing to allocate the SQ/CQ rings, and
then entering shutdown. Check if the rings are valid before trying to
access them at shutdown time.

Reported-by: syzbot+21147d79607d724bd6f3@syzkaller.appspotmail.com
Fixes: 1d7bb1d50f ("io_uring: add support for backlogged CQ ring")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 09:11:36 -07:00
Trond Myklebust
5326de9e94 NFSv4.x: Drop the slot if nfs4_delegreturn_prepare waits for layoutreturn
If nfs4_delegreturn_prepare needs to wait for a layoutreturn to complete
then make sure we drop the sequence slot if we hold it.

Fixes: 1c5bd76d17 ("pNFS: Enable layoutreturn operation for return-on-close")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-13 16:37:17 +01:00
Trond Myklebust
5c441544f0 NFSv4.x: Handle bad/dead sessions correctly in nfs41_sequence_process()
If the server returns a bad or dead session error, the we don't want
to update the session slot number, but just immediately schedule
recovery and allow it to proceed.

We can/should then remove handling in other places

Fixes: 3453d5708b ("NFSv4.1: Avoid false retries when RPC calls are interrupted")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-13 16:37:17 +01:00
Peter Zijlstra
cf25e24db6 time: Rename tsk->real_start_time to ->start_boottime
Since it stores CLOCK_BOOTTIME, not, as the name suggests,
CLOCK_REALTIME, let's rename ->real_start_time to ->start_bootime.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 11:09:49 +01:00
Christoph Hellwig
d41003513e block: rework zone reporting
Avoid the need to allocate a potentially large array of struct blk_zone
in the block layer by switching the ->report_zones method interface to
a callback model. Now the caller simply supplies a callback that is
executed on each reported zone, and private data for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 19:12:07 -07:00
Darrick J. Wong
f9e0370648 xfs: kill the XFS_WANT_CORRUPT_* macros
The XFS_WANT_CORRUPT_* macros conceal subtle side effects such as the
creation of local variables and redirections of the code flow.  This is
pretty ugly, so replace them with explicit XFS_IS_CORRUPT tests that
remove both of those ugly points.  The change was performed with the
following coccinelle script:

@@
expression mp, test;
identifier label;
@@

- XFS_WANT_CORRUPTED_GOTO(mp, test, label);
+ if (XFS_IS_CORRUPT(mp, !test)) { error = -EFSCORRUPTED; goto label; }

@@
expression mp, test;
@@

- XFS_WANT_CORRUPTED_RETURN(mp, test);
+ if (XFS_IS_CORRUPT(mp, !test)) return -EFSCORRUPTED;

@@
expression mp, lval, rval;
@@

- XFS_IS_CORRUPT(mp, !(lval == rval))
+ XFS_IS_CORRUPT(mp, lval != rval)

@@
expression mp, e1, e2;
@@

- XFS_IS_CORRUPT(mp, !(e1 && e2))
+ XFS_IS_CORRUPT(mp, !e1 || !e2)

@@
expression e1, e2;
@@

- !(e1 == e2)
+ e1 != e2

@@
expression e1, e2, e3, e4, e5, e6;
@@

- !(e1 == e2 && e3 == e4) || e5 != e6
+ e1 != e2 || e3 != e4 || e5 != e6

@@
expression e1, e2, e3, e4, e5, e6;
@@

- !(e1 == e2 || (e3 <= e4 && e5 <= e6))
+ e1 != e2 && (e3 > e4 || e5 > e6)

@@
expression mp, e1, e2;
@@

- XFS_IS_CORRUPT(mp, !(e1 <= e2))
+ XFS_IS_CORRUPT(mp, e1 > e2)

@@
expression mp, e1, e2;
@@

- XFS_IS_CORRUPT(mp, !(e1 < e2))
+ XFS_IS_CORRUPT(mp, e1 >= e2)

@@
expression mp, e1;
@@

- XFS_IS_CORRUPT(mp, !!e1)
+ XFS_IS_CORRUPT(mp, e1)

@@
expression mp, e1, e2;
@@

- XFS_IS_CORRUPT(mp, !(e1 || e2))
+ XFS_IS_CORRUPT(mp, !e1 && !e2)

@@
expression mp, e1, e2, e3, e4;
@@

- XFS_IS_CORRUPT(mp, !(e1 == e2) && !(e3 == e4))
+ XFS_IS_CORRUPT(mp, e1 != e2 && e3 != e4)

@@
expression mp, e1, e2, e3, e4;
@@

- XFS_IS_CORRUPT(mp, !(e1 <= e2) || !(e3 >= e4))
+ XFS_IS_CORRUPT(mp, e1 > e2 || e3 < e4)

@@
expression mp, e1, e2, e3, e4;
@@

- XFS_IS_CORRUPT(mp, !(e1 == e2) && !(e3 <= e4))
+ XFS_IS_CORRUPT(mp, e1 != e2 && e3 > e4)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-12 17:19:02 -08:00
Darrick J. Wong
1ec28615d2 xfs: add a XFS_IS_CORRUPT macro
Add a new macro, XFS_IS_CORRUPT, which we will use to integrate some
corruption reporting when the corruption test expression is true.  This
will be used in the next patch to remove the ugly XFS_WANT_CORRUPT*
macros.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-12 17:19:02 -08:00
Scott Mayhew
a2e2f2dc77 nfsd: v4 support requires CRYPTO_SHA256
The new nfsdcld client tracking operations use sha256 to compute hashes
of the kerberos principals, so make sure CRYPTO_SHA256 is enabled.

Fixes: 6ee95d1c89 ("nfsd: add support for upcall version 2")
Reported-by: Jamie Heilman <jamie@audible.transient.net>
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-11-12 14:53:26 -05:00
Scott Mayhew
18b9a895e6 nfsd: Fix cld_net->cn_tfm initialization
Don't assign an error pointer to cld_net->cn_tfm, otherwise an oops will
occur in nfsd4_remove_cld_pipe().

Also, move the initialization of cld_net->cn_tfm so that it occurs after
the check to see if nfsdcld is running.  This is necessary because
nfsd4_client_tracking_init() looks for -ETIMEDOUT to determine whether
to use the "old" nfsdcld tracking ops.

Fixes: 6ee95d1c89 ("nfsd: add support for upcall version 2")
Reported-by: Jamie Heilman <jamie@audible.transient.net>
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-11-12 14:53:26 -05:00
Jens Axboe
7c9e7f0fe0 io_uring: fix potential deadlock in io_poll_wake()
We attempt to run the poll completion inline, but we're using trylock to
do so. This avoids a deadlock since we're grabbing the locks in reverse
order at this point, we already hold the poll wq lock and we're trying
to grab the completion lock, while the normal rules are the reverse of
that order.

IO completion for a timeout link will need to grab the completion lock,
but that's not safe from this context. Put the completion under the
completion_lock in io_poll_wake(), and mark the request as entering
the completion with the completion_lock already held.

Fixes: 2665abfd75 ("io_uring: add support for linked SQE timeouts")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 12:26:34 -07:00
Tejun Heo
40430452fd kernfs: use 64bit inos if ino_t is 64bit
Each kernfs_node is identified with a 64bit ID.  The low 32bit is
exposed as ino and the high gen.  While this already allows using inos
as keys by looking up with wildcard generation number of 0, it's
adding unnecessary complications for 64bit ino archs which can
directly use kernfs_node IDs as inos to uniquely identify each cgroup
instance.

This patch exposes IDs directly as inos on 64bit ino archs.  The
conversion is mostly straight-forward.

* 32bit ino archs behave the same as before.  64bit ino archs now use
  the whole 64bit ID as ino and the generation number is fixed at 1.

* 64bit inos still use the same idr allocator which gurantees that the
  lower 32bits identify the current live instance uniquely and the
  high 32bits are incremented whenever the low bits wrap.  As the
  upper 32bits are no longer used as gen and we don't wanna start ino
  allocation with 33rd bit set, the initial value for highbits
  allocation is changed to 0 on 64bit ino archs.

* blktrace exposes two 32bit numbers - (INO,GEN) pair - to identify
  the issuing cgroup.  Userland builds FILEID_INO32_GEN fids from
  these numbers to look up the cgroups.  To remain compatible with the
  behavior, always output (LOW32,HIGH32) which will be constructed
  back to the original 64bit ID by __kernfs_fh_to_dentry().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12 08:18:04 -08:00
Tejun Heo
33c5ac9175 kernfs: implement custom exportfs ops and fid type
The current kernfs exportfs implementation uses the generic_fh_*()
helpers and FILEID_INO32_GEN[_PARENT] which limits ino to 32bits.
Let's implement custom exportfs operations and fid type to remove the
restriction.

* FILEID_KERNFS is a single u64 value whose content is
  kernfs_node->id.  This is the only native fid type.

* For backward compatibility with blk_log_action() path which exposes
  (ino,gen) pairs which userland assembles into FILEID_INO32_GEN keys,
  combine the generic keys into 64bit IDs in the same order.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12 08:18:04 -08:00
Tejun Heo
fe0f726c9f kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()
kernfs_find_and_get_node_by_ino() looks the kernfs_node matching the
specified ino.  On top of that, kernfs_get_node_by_id() and
kernfs_fh_get_inode() implement full ID matching by testing the rest
of ID.

On surface, confusingly, the two are slightly different in that the
latter uses 0 gen as wildcard while the former doesn't - does it mean
that the latter can't uniquely identify inodes w/ 0 gen?  In practice,
this is a distinction without a difference because generation number
starts at 1.  There are no actual IDs with 0 gen, so it can always
safely used as wildcard.

Let's simplify the code by renaming kernfs_find_and_get_node_by_ino()
to kernfs_find_and_get_node_by_id(), moving all lookup logics into it,
and removing now unnecessary kernfs_get_node_by_id().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12 08:18:04 -08:00
Tejun Heo
67c0496e87 kernfs: convert kernfs_node->id from union kernfs_node_id to u64
kernfs_node->id is currently a union kernfs_node_id which represents
either a 32bit (ino, gen) pair or u64 value.  I can't see much value
in the usage of the union - all that's needed is a 64bit ID which the
current code is already limited to.  Using a union makes the code
unnecessarily complicated and prevents using 64bit ino without adding
practical benefits.

This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
ino is stored in the lower 32bits and gen upper.  Accessors -
kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
ino and gen.  This simplifies ID handling less cumbersome and will
allow using 64bit inos on supported archs.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexei Starovoitov <ast@kernel.org>
2019-11-12 08:18:03 -08:00
Tejun Heo
880df13161 kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes
kernfs node can be created in two separate steps - allocation and
activation.  This is used to make kernfs nodes visible only after the
internal states attached to the node are fully initialized.
kernfs_find_and_get_node_by_id() currently allows lookups of nodes
which aren't activated yet and thus can expose nodes are which are
still being prepped by kernfs users.

Fix it by disallowing lookups of nodes which aren't activated yet.

kernfs_find_and_get_node_by_ino()

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12 08:18:03 -08:00
Tejun Heo
b680b08171 kernfs: use dumber locking for kernfs_find_and_get_node_by_ino()
kernfs_find_and_get_node_by_ino() uses RCU protection.  It's currently
a bit buggy because it can look up a node which hasn't been activated
yet and thus may end up exposing a node that the kernfs user is still
prepping.

While it can be fixed by pushing it further in the current direction,
it's already complicated and isn't clear whether the complexity is
justified.  The main use of kernfs_find_and_get_node_by_ino() is for
exportfs operations.  They aren't super hot and all the follow-up
operations (e.g. mapping to path) use normal locking anyway.

Let's switch to a dumber locking scheme and protect the lookup with
kernfs_idr_lock.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12 08:18:03 -08:00
Tejun Heo
e23f568aa6 kernfs: fix ino wrap-around detection
When the 32bit ino wraps around, kernfs increments the generation
number to distinguish reused ino instances.  The wrap-around detection
tests whether the allocated ino is lower than what the cursor but the
cursor is pointing to the next ino to allocate so the condition never
triggers.

Fix it by remembering the last ino and comparing against that.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 4a3ef68aca ("kernfs: implement i_generation")
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: stable@vger.kernel.org # v4.14+
2019-11-12 08:18:03 -08:00
Jens Axboe
960e432dfa io_uring: use correct "is IO worker" helper
Since we switched to io-wq, the dependent link optimization for when to
pass back work inline has been broken. Fix this by providing a suitable
io-wq helper for io_uring to use to detect when to do this.

Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 08:02:26 -07:00
Andreas Gruenbacher
19ebc050e4 gfs2: Remove active journal side effect from gfs2_write_log_header
Function gfs2_write_log_header can be used to write a log header into any of
the journals of a filesystem.  When used on the node's own journal,
gfs2_write_log_header advances the current position in the log
(sdp->sd_log_flush_head) as a side effect, through function gfs2_log_bmap.

This is confusing, and it also means that we can't use gfs2_log_bmap for other
journals even if they have an extent map.  So clean this mess up by not
advancing sdp->sd_log_flush_head in gfs2_write_log_header or gfs2_log_bmap
anymore and making that a responsibility of the callers instead.

This is related to commit 7c70b89695 ("gfs2: clean_journal improperly set
sd_log_flush_head").

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-12 15:17:53 +01:00
YueHaibing
00929447f5 virtiofs: Fix old-style declaration
There expect the 'static' keyword to come first in a declaration, and we
get warnings like this with "make W=1":

fs/fuse/virtio_fs.c:687:1: warning: 'static' is not at beginning of declaration [-Wold-style-declaration]
fs/fuse/virtio_fs.c:692:1: warning: 'static' is not at beginning of declaration [-Wold-style-declaration]
fs/fuse/virtio_fs.c:1029:1: warning: 'static' is not at beginning of declaration [-Wold-style-declaration]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-12 11:49:04 +01:00
Miklos Szeredi
c634da718d fuse: verify nlink
When adding a new hard link, make sure that i_nlink doesn't overflow.

Fixes: ac45d61357 ("fuse: fix nlink after unlink")
Cc: <stable@vger.kernel.org> # v3.4
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-12 11:49:04 +01:00
Miklos Szeredi
8aab336b14 fuse: verify write return
Make sure filesystem is not returning a bogus number of bytes written.

Fixes: ea9b9907b8 ("fuse: implement perform_write")
Cc: <stable@vger.kernel.org> # v2.6.26
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-12 11:49:04 +01:00
Miklos Szeredi
eb59bd17d2 fuse: verify attributes
If a filesystem returns negative inode sizes, future reads on the file were
causing the cpu to spin on truncate_pagecache.

Create a helper to validate the attributes.  This now does two things:

 - check the file mode
 - check if the file size fits in i_size without overflowing

Reported-by: Arijit Banerjee <arijit@rubrik.com>
Fixes: d8a5ba4545 ("[PATCH] FUSE - core")
Cc: <stable@vger.kernel.org> # v2.6.14
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-12 11:49:04 +01:00
Jens Axboe
93bd25bb69 io_uring: make timeout sequence == 0 mean no sequence
Currently we make sequence == 0 be the same as sequence == 1, but that's
not super useful if the intent is really to have a timeout that's just
a pure timeout.

If the user passes in sqe->off == 0, then don't apply any sequence logic
to the request, let it purely be driven by the timeout specified.

Reported-by: 李通洲 <carter.li@eoitek.com>
Reviewed-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 00:18:51 -07:00
Jens Axboe
76a46e066e io_uring: fix -ENOENT issue with linked timer with short timeout
If you prep a read (for example) that needs to get punted to async
context with a timer, if the timeout is sufficiently short, the timer
request will get completed with -ENOENT as it could not find the read.

The issue is that we prep and start the timer before we start the read.
Hence the timer can trigger before the read is even started, and the end
result is then that the timer completes with -ENOENT, while the read
starts instead of being cancelled by the timer.

Fix this by splitting the linked timer into two parts:

1) Prep and validate the linked timer
2) Start timer

The read is then started between steps 1 and 2, so we know that the
timer will always have a consistent view of the read request state.

Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11 16:33:22 -07:00
Jens Axboe
768134d4f4 io_uring: don't do flush cancel under inflight_lock
We can't safely cancel under the inflight lock. If the work hasn't been
started yet, then io_wq_cancel_work() simply marks the work as cancelled
and invokes the work handler. But if the work completion needs to grab
the inflight lock because it's grabbing user files, then we'll deadlock
trying to finish the work as we already hold that lock.

Instead grab a reference to the request, if it isn't already zero. If
it's zero, then we know it's going through completion anyway, and we
can safely ignore it. If it's not zero, then we can drop the lock and
attempt to cancel from there.

This also fixes a missing finish_wait() at the end of
io_uring_cancel_files().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11 16:33:17 -07:00
Jens Axboe
c1edbf5f08 io_uring: flag SQPOLL busy condition to userspace
Now that we have backpressure, for SQPOLL, we have one more condition
that warrants flagging that the application needs to enter the kernel:
we failed to submit IO due to backpressure. Make sure we catch that
and flag it appropriately.

If we run into backpressure issues with the SQPOLL thread, flag it
as such to the application by setting IORING_SQ_NEED_WAKEUP. This will
cause the application to enter the kernel, and that will flush the
backlog and clear the condition.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11 16:33:11 -07:00
Jens Axboe
47f467686e io_uring: make ASYNC_CANCEL work with poll and timeout
It's a little confusing that we have multiple types of command
cancellation opcodes now that we have a generic one. Make the generic
one work with POLL_ADD and TIMEOUT commands as well, that makes for an
easier to use API for the application. The fact that they currently
don't is a bit confusing.

Add a helper that takes care of it, so we can user it from both
IORING_OP_ASYNC_CANCEL and from the linked timeout cancellation.

Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11 16:33:05 -07:00
Jens Axboe
0ddf92e848 io_uring: provide fallback request for OOM situations
One thing that really sucks for userspace APIs is if the kernel passes
back -ENOMEM/-EAGAIN for resource shortages. The application really has
no idea of what to do in those cases. Should it try and reap
completions? Probably a good idea. Will it solve the issue? Who knows.

This patch adds a simple fallback mechanism if we fail to allocate
memory for a request. If we fail allocating memory from the slab for a
request, we punt to a pre-allocated request. There's just one of these
per io_ring_ctx, but the important part is if we ever return -EBUSY to
the application, the applications knows that it can wait for events and
make forward progress when events have completed. This is the important
part.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11 16:32:55 -07:00
Jan Stancek
e9f930ac88 iomap: fix return value of iomap_dio_bio_actor on 32bit systems
Naresh reported LTP diotest4 failing for 32bit x86 and arm -next
kernels on ext4. Same problem exists in 5.4-rc7 on xfs.

The failure comes down to:
  openat(AT_FDCWD, "testdata-4.5918", O_RDWR|O_DIRECT) = 4
  mmap2(NULL, 4096, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f7b000
  read(4, 0xb7f7b000, 4096)              = 0 // expects -EFAULT

Problem is conversion at iomap_dio_bio_actor() return. Ternary
operator has a return type and an attempt is made to convert each
of operands to the type of the other. In this case "ret" (int)
is converted to type of "copied" (unsigned long). Both have size
of 4 bytes:
    size_t copied = 0;
    int ret = -14;
    long long actor_ret = copied ? copied : ret;

    On x86_64: actor_ret == -14;
    On x86   : actor_ret == 4294967282

Replace ternary operator with 2 return statements to avoid this
unwanted conversion.

Fixes: 4721a60109 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-11 12:58:24 -08:00
Darrick J. Wong
2713fefa5d xfs: attach dquots before performing xfs_swap_extents
Make sure we attach dquots to both inodes before swapping their extents.
This was found via manual code inspection by looking for places where we
could call xfs_trans_mod_dquot without dquots attached to inodes, and
confirmed by instrumenting the kernel and running xfs/328.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-11 12:42:52 -08:00
Darrick J. Wong
2815a16d7f xfs: attach dquots and reserve quota blocks during unwritten conversion
In xfs_iomap_write_unwritten, we need to ensure that dquots are attached
to the inode and quota blocks reserved so that we capture in the quota
counters any blocks allocated to handle a bmbt split.  This can happen
on the first unwritten extent conversion to a preallocated sparse file
on a fresh mount.

This was found by running generic/311 with quotas enabled.  The bug
seems to have been introduced in "[XFS] rework iocore infrastructure,
remove some code and make it more" from ~2002?

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-11 12:42:52 -08:00
Darrick J. Wong
27d9ee577d xfs: actually check xfs_btree_check_block return in xfs_btree_islastblock
Coverity points out that xfs_btree_islastblock doesn't check the return
value of xfs_btree_check_block.  Since the question "Does the cursor
point to the last block in this level?" only makes sense if the caller
previously performed a lookup or seek operation, the block should
already have been checked.

Therefore, check the return value in an ASSERT and turn the whole thing
into a static inline predicate.

Coverity-id: 114069
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-11 12:42:52 -08:00
Filipe Manana
e6c617102c Btrfs: fix log context list corruption after rename exchange operation
During rename exchange we might have successfully log the new name in the
source root's log tree, in which case we leave our log context (allocated
on stack) in the root's list of log contextes. However we might fail to
log the new name in the destination root, in which case we fallback to
a transaction commit later and never sync the log of the source root,
which causes the source root log context to remain in the list of log
contextes. This later causes invalid memory accesses because the context
was allocated on stack and after rename exchange finishes the stack gets
reused and overwritten for other purposes.

The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can
detect this and report something like the following:

  [  691.489929] ------------[ cut here ]------------
  [  691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38).
  [  691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0
  (...)
  [  691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1
  [  691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  [  691.490003] RIP: 0010:__list_add_valid+0x95/0xe0
  (...)
  [  691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282
  [  691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000
  [  691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0
  [  691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114
  [  691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff
  [  691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b
  [  691.490019] FS:  00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000
  [  691.490021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0
  [  691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [  691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [  691.490029] Call Trace:
  [  691.490058]  btrfs_log_inode_parent+0x667/0x2730 [btrfs]
  [  691.490083]  ? join_transaction+0x24a/0xce0 [btrfs]
  [  691.490107]  ? btrfs_end_log_trans+0x80/0x80 [btrfs]
  [  691.490111]  ? dget_parent+0xb8/0x460
  [  691.490116]  ? lock_downgrade+0x6b0/0x6b0
  [  691.490121]  ? rwlock_bug.part.0+0x90/0x90
  [  691.490127]  ? do_raw_spin_unlock+0x142/0x220
  [  691.490151]  btrfs_log_dentry_safe+0x65/0x90 [btrfs]
  [  691.490172]  btrfs_sync_file+0x9f1/0xc00 [btrfs]
  [  691.490195]  ? btrfs_file_write_iter+0x1800/0x1800 [btrfs]
  [  691.490198]  ? rcu_read_lock_any_held.part.11+0x20/0x20
  [  691.490204]  ? __do_sys_newstat+0x88/0xd0
  [  691.490207]  ? cp_new_stat+0x5d0/0x5d0
  [  691.490218]  ? do_fsync+0x38/0x60
  [  691.490220]  do_fsync+0x38/0x60
  [  691.490224]  __x64_sys_fdatasync+0x32/0x40
  [  691.490228]  do_syscall_64+0x9f/0x540
  [  691.490233]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [  691.490235] RIP: 0033:0x7f4b253ad5f0
  (...)
  [  691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
  [  691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0
  [  691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003
  [  691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c
  [  691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4
  [  691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0

This started happening recently when running some test cases from fstests
like btrfs/004 for example, because support for rename exchange was added
last week to fsstress from fstests.

So fix this by deleting the log context for the source root from the list
if we have logged the new name in the source root.

Reported-by: Su Yue <Damenly_Su@gmx.com>
Fixes: d4682ba03e ("Btrfs: sync log after logging new name")
CC: stable@vger.kernel.org # 4.19+
Tested-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-11 19:46:02 +01:00
Al Viro
581ae686f2 race in exportfs_decode_fh()
On Sat, Nov 02, 2019 at 06:08:42PM +0000, Al Viro wrote:

> It is converging to a reasonably small and understandable surface, actually,
> most of that being in core pathname resolution.  Two big piles of nightmares
> left to review - overlayfs and (somewhat surprisingly) setxattr call chains,
> the latter due to IMA/EVM/LSM insanity...

Oh, lovely - in exportfs_decode_fh() we have this:
                err = exportfs_get_name(mnt, target_dir, nbuf, result);
                if (!err) {
                        inode_lock(target_dir->d_inode);
                        nresult = lookup_one_len(nbuf, target_dir,
                                                 strlen(nbuf));
                        inode_unlock(target_dir->d_inode);
                        if (!IS_ERR(nresult)) {
                                if (nresult->d_inode) {
                                        dput(result);
                                        result = nresult;
                                } else
                                        dput(nresult);
                        }
                }
We have derived the parent from fhandle, we have a disconnected dentry for child,
we go look for the name.  We even find it.  Now, we want to look it up.  And
some bastard goes and unlinks it, just as we are trying to lock the parent.
We do a lookup, and get a negative dentry.  Then we unlock the parent... and
some other bastard does e.g. mkdir with the same name.  OK, nresult->d_inode
is not NULL (anymore).  It has fuck-all to do with the original fhandle
(different inumber, etc.) but we happily accept it.

Even better, we have no barriers between our check and nresult becoming positive.
IOW, having observed non-NULL ->d_inode doesn't give us enough - e.g. we might
still see the old ->d_flags value, from back when ->d_inode used to be NULL.
On something like alpha we also have no promises that we'll observe anything
about the fields of nresult->d_inode, but ->d_flags alone is enough for fun.
The callers can't e.g. expect d_is_reg() et.al. to match the reality.

This is obviously bogus.  And the fix is obvious: check that nresult->d_inode is
equal to result->d_inode before unlocking the parent.  Note that we'd *already* had
the original result and all of its aliases rejected by the 'acceptable' predicate,
so if nresult doesn't supply us a better alias, we are SOL.

Does anyone see objections to the following patch?  Christoph, that seems to
be your code; am I missing something subtle here?  AFAICS, that goes back to
2007 or so...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-11-11 09:21:59 -05:00
Konstantin Khlebnikov
6fcbcec9cf fs/quota: handle overflows of sysctl fs.quota.* and report as unsigned long
Quota statistics counted as 64-bit per-cpu counter. Reading sums per-cpu
fractions as signed 64-bit int, filters negative values and then reports
lower half as signed 32-bit int.

Result may looks like:

fs.quota.allocated_dquots = 22327
fs.quota.cache_hits = -489852115
fs.quota.drops = -487288718
fs.quota.free_dquots = 22083
fs.quota.lookups = -486883485
fs.quota.reads = 22327
fs.quota.syncs = 335064
fs.quota.writes = 3088689

Values bigger than 2^31-1 reported as negative.

All counters except "allocated_dquots" and "free_dquots" are monotonic,
thus they should be reported as is without filtering negative values.

Kernel doesn't have generic helper for 64-bit sysctl yet,
let's use at least unsigned long.

Link: https://lore.kernel.org/r/157337934693.2078.9842146413181153727.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-11 11:06:27 +01:00
Ingo Molnar
6d5a763c30 Linux 5.4-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl3IqJQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGOiUH+gOEDwid5OODaFAd
 CggXugdFIlBZefKqGVNW5sjgX8pxFWHXuEMC8iNb6QXtQZdFrI6LFf9hhUDmzQtm
 6y1LPxxEiTZjObMEsBNylb7tyzgujFHcAlp0Zro3w/HLCqmYTSP3FF46i2u6KZfL
 XhkpM4X7R7qxlfpdhlfESv/ElRGocZe6SwXfC7pcPo5flFcmkdu9ijqhNd/6CZ/h
 Nf9rTsD/wEDVUelFbgVN+LJzlaB0tsyc4Zbof07n8OsFZjhdEOop8gfM/kTBLcyY
 6bh66SfDScdsNnC/l8csbPjSZRx+i+nQs67DyhGNnsSAFgHBZdC4Tb/2mDCwhCLR
 dUvuYZc=
 =1N6F
 -----END PGP SIGNATURE-----

Merge tag 'v5.4-rc7' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:34:59 +01:00
Jens Axboe
8e3cca1270 io_uring: convert accept4() -ERESTARTSYS into -EINTR
If we cancel a pending accept operating with a signal, we get
-ERESTARTSYS returned. Turn that into -EINTR for userspace, we should
not be return -ERESTARTSYS.

Fixes: 17f2fe35d0 ("io_uring: add support for IORING_OP_ACCEPT")
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jens Axboe
46568e9be7 io_uring: fix error clear of ->file_table in io_sqe_files_register()
syzbot reports that when using failslab and friends, we can get a double
free in io_sqe_files_unregister():

BUG: KASAN: double-free or invalid-free in
io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185

CPU: 1 PID: 8819 Comm: syz-executor452 Not tainted 5.4.0-rc6-next-20191108
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x197/0x210 lib/dump_stack.c:118
  print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
  kasan_report_invalid_free+0x65/0xa0 mm/kasan/report.c:468
  __kasan_slab_free+0x13a/0x150 mm/kasan/common.c:450
  kasan_slab_free+0xe/0x10 mm/kasan/common.c:480
  __cache_free mm/slab.c:3426 [inline]
  kfree+0x10a/0x2c0 mm/slab.c:3757
  io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185
  io_ring_ctx_free fs/io_uring.c:3998 [inline]
  io_ring_ctx_wait_and_kill+0x348/0x700 fs/io_uring.c:4060
  io_uring_release+0x42/0x50 fs/io_uring.c:4068
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  exit_task_work include/linux/task_work.h:22 [inline]
  do_exit+0x904/0x2e60 kernel/exit.c:817
  do_group_exit+0x135/0x360 kernel/exit.c:921
  __do_sys_exit_group kernel/exit.c:932 [inline]
  __se_sys_exit_group kernel/exit.c:930 [inline]
  __x64_sys_exit_group+0x44/0x50 kernel/exit.c:930
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43f2c8
Code: 31 b8 c5 f7 ff ff 48 8b 5c 24 28 48 8b 6c 24 30 4c 8b 64 24 38 4c 8b
6c 24 40 4c 8b 74 24 48 4c 8b 7c 24 50 48 83 c4 58 c3 66 <0f> 1f 84 00 00
00 00 00 48 8d 35 59 ca 00 00 0f b6 d2 48 89 fb 48
RSP: 002b:00007ffd5b976008 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043f2c8
RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
RBP: 00000000004bf0a8 R08: 00000000000000e7 R09: ffffffffffffffd0
R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001
R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000

This happens if we fail allocating the file tables. For that case we do
free the file table correctly, but we forget to set it to NULL. This
means that ring teardown will see it as being non-NULL, and attempt to
free it again.

Fix this by clearing the file_table pointer if we free the table.

Reported-by: syzbot+3254bc44113ae1e331ee@syzkaller.appspotmail.com
Fixes: 65e19f54d2 ("io_uring: support for larger fixed file sets")
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jackie Liu
c69f8dbe24 io_uring: separate the io_free_req and io_free_req_find_next interface
Similar to the distinction between io_put_req and io_put_req_find_next,
io_free_req has been modified similarly, with no functional changes.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jackie Liu
ec9c02ad4c io_uring: keep io_put_req only responsible for release and put req
We already have io_put_req_find_next to find the next req of the link.
we should not use the io_put_req function to find them. They should be
functions of the same level.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jackie Liu
a197f664a0 io_uring: remove passed in 'ctx' function parameter ctx if possible
Many times, the core of the function is req, and req has already set
req->ctx at initialization time, so there is no need to pass in the
ctx from the caller.

Cleanup, no functional change.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jens Axboe
206aefde4f io_uring: reduce/pack size of io_ring_ctx
With the recent flurry of additions and changes to io_uring, the
layout of io_ring_ctx has become a bit stale. We're right now at
704 bytes in size on my x86-64 build, or 11 cachelines. This
patch does two things:

- We have to completion structs embedded, that we only use for
  quiesce of the ctx (or shutdown) and for sqthread init cases.
  That 2x32 bytes right there, let's dynamically allocate them.

- Reorder the struct a bit with an eye on cachelines, use cases,
  and holes.

With this patch, we're down to 512 bytes, or 8 cachelines.

Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Christoph Hellwig
23220fe260 xfs: always pass a valid hdr to xfs_dir3_leaf_check_int
Move the code for extracting the incore header to the only caller that
didn't already do that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:24 -08:00
Christoph Hellwig
ae42976de7 xfs: merge xfs_dir2_data_freescan and xfs_dir2_data_freescan_int
There is no real need for xfs_dir2_data_freescan wrapper, so rename
xfs_dir2_data_freescan_int to xfs_dir2_data_freescan and let the
callers dereference the mount pointer from the inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:24 -08:00
Christoph Hellwig
957ee13e20 xfs: remove the now unused dir ops infrastructure
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:24 -08:00
Christoph Hellwig
59b8b46505 xfs: devirtualize ->data_get_ftype and ->data_put_ftype
Replace the ->data_get_ftype and ->data_put_ftype dir ops methods with
directly called xfs_dir2_data_get_ftype and xfs_dir2_data_put_ftype
helpers that takes care of the differences between the directory format
with and without the file type field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:24 -08:00
Christoph Hellwig
1848b607a9 xfs: devirtualize ->data_bestfree_p
Replace the ->data_bestfree_p dir ops method with a directly called
xfs_dir2_data_bestfree_p helper that takes care of the differences
between the v4 and v5 on-disk format.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:24 -08:00
Christoph Hellwig
711c7dbf5f xfs: cleanup xfs_dir2_data_entsize
Remove the XFS_DIR2_DATA_ENTSIZE and XFS_DIR3_DATA_ENTSIZE and open
code them in their only caller, which now becomes so simple that
we can turn it into an inline function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:24 -08:00
Christoph Hellwig
d73e1cee8a xfs: move the dir2 data block fixed offsets to struct xfs_da_geometry
Move the data block fixed offsets towards our structure for dir/attr
geometry parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:24 -08:00
Christoph Hellwig
7e8ae7bd1c xfs: devirtualize ->data_entry_tag_p
Replace the ->data_entry_tag_p dir ops method with a directly called
xfs_dir2_data_entry_tag_p helper that takes care of the differences
between the directory format with and without the file type field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
fdbb8c5b80 xfs: devirtualize ->data_entsize
Replace the ->data_entsize dir ops method with a directly called
xfs_dir2_data_entsize helper that takes care of the differences between
the directory format with and without the file type field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
5c072127d3 xfs: replace xfs_dir3_data_endp with xfs_dir3_data_end_offset
All the callers really want an offset into the buffer, so adopt
the helper to return that instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
9eedae1089 xfs: remove the now unused ->data_entry_p method
Now that all users use the data_entry_offset field this method is
unused and can be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
48a71399e7 xfs: cleanup __xfs_dir3_data_check
Use an offset as the main means for iteration, and only do pointer
arithmetics to find the data/unused entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
62479f5734 xfs: cleanup xfs_dir2_data_freescan_int
Use an offset as the main means for iteration, and only do pointer
arithmetics to find the data/unused entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
8073af5153 xfs: cleanup xfs_dir2_block_to_sf
Use an offset as the main means for iteration, and only do pointer
arithmetics to find the data/unused entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
4a1a8b2f5f xfs: cleanup xchk_directory_data_bestfree
Use an offset as the main means for iteration, and only do pointer
arithmetics to find the data/unused entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:23 -08:00
Christoph Hellwig
4c037dd5fd xfs: cleanup xchk_dir_rec
Use an offset as the main means for iteration, and only do pointer
arithmetics to find the data/unused entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
2f4369a862 xfs: cleanup xfs_dir2_leaf_getdents
Use an offset as the main means for iteration, and only do pointer
arithmetics to find the data/unused entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
263dde869b xfs: cleanup xfs_dir2_block_getdents
Use an offset as the main means for iteration, and only do pointer
arithmetics to find the data/unused entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
ee641d5af5 xfs: remove the ->data_unused_p method
Replace the two users of the ->data_unused_p dir ops method with a
direct calculation using ->data_entry_offset, and clean them up a bit.
xfs_dir2_sf_to_block already had an offset variable containing the
value of ->data_entry_offset, which we are now reusing to make it
clear that the initial freespace entry is at the same place that
we later fill in the 1 entry, and in xfs_dir3_data_init the function
is cleaned up a bit to keep the initialization of fields of a given
structure close to each other, and to avoid a local variable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
da3ca0df8b xfs: remove the ->data_dot_entry_p and ->data_dotdot_entry_p methods
The only user of the ->data_dot_entry_p and ->data_dotdot_entry_p
methods is the xfs_dir2_sf_to_block function that builds block format
directorys from a short form directory.  It already uses pointer
arithmetics with a offset variable to do so for the real entries in
the directory, so switch the generation of the . and .. entries to
the same scheme, and clean up some of the later pointer arithmetics
to use bp->b_addr directly as well and avoid some casts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
2eb68a5d36 xfs: remove the data_dotdot_offset field in struct xfs_dir_ops
The data_dotdot_offset value is always equal to data_entry_offset plus
the fixed size of the "." entry.  Right now calculating that fixed size
requires an indirect call, but by the end of this series it will be
an inline function that can be constant folded.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
1682310474 xfs: remove the data_dot_offset field in struct xfs_dir_ops
The data_dot_offset value is always equal to data_entry_offset given
that "." is always the first entry in the directory.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
c81484e2b9 xfs: remove the unused ->data_first_entry_p method
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:22 -08:00
Christoph Hellwig
4501ed2a3a xfs: devirtualize ->sf_get_ftype and ->sf_put_ftype
Replace the ->sf_get_ftype and ->sf_put_ftype dir ops methods with
directly called xfs_dir2_sf_get_ftype and xfs_dir2_sf_put_ftype helpers
that takes care of the differences between the directory format with and
without the file type field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
93b1e96a42 xfs: devirtualize ->sf_get_ino and ->sf_put_ino
Replace the ->sf_get_ino and ->sf_put_ino dir ops methods with directly
called xfs_dir2_sf_get_ino and xfs_dir2_sf_put_ino helpers that take care
of the difference between the directory format with and without the file
type field.  Also move xfs_dir2_sf_get_parent_ino and
xfs_dir2_sf_put_parent_ino to xfs_dir2_sf.c with the rest of the
low-level short form entry handling and use XFS_MAXINUMBER istead of
opencoded constants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
50f6bb6b7a xfs: devirtualize ->sf_entsize and ->sf_nextentry
Just check for file-type enabled directories directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
84915e1bdd xfs: devirtualize ->sf_get_parent_ino and ->sf_put_parent_ino
The parent inode handling is the same for all directory format variants,
just use direct calls instead of going through a pointless indirect
call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
3d92c93b70 xfs: devirtualize ->db_to_fdb and ->db_to_fdindex
Now that the max bests value is in struct xfs_da_geometry both instances
of ->db_to_fdb and ->db_to_fdindex are identical.  Replace them with
local xfs_dir2_db_to_fdb and xfs_dir2_db_to_fdindex functions in
xfs_dir2_node.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
5893e4feb0 xfs: move the max dir2 free bests count to struct xfs_da_geometry
Move the max free bests count towards our structure for dir/attr
geometry parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
ed1d612fbe xfs: move the dir2 free header size to struct xfs_da_geometry
Move the free header size towards our structure for dir/attr geometry
parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
a84f3d5cb0 xfs: add a bests pointer to struct xfs_dir3_icfree_hdr
All but two callers of the ->free_bests_p dir operation already have a
struct xfs_dir3_icfree_hdr from a previous call to
xfs_dir2_free_hdr_from_disk at hand.  Add a pointer to the bests to
struct xfs_dir3_icfree_hdr to clean up this pattern.  To optimize this
pattern, pass the struct xfs_dir3_icfree_hdr to xfs_dir2_free_log_bests
instead of recalculating the pointer there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:21 -08:00
Christoph Hellwig
195b0a44ab xfs: make the xfs_dir3_icfree_hdr available to xfs_dir2_node_addname_int
Return the xfs_dir3_icfree_hdr used by the helpers called from
xfs_dir2_node_addname_int to the main function to prepare for the
next round of changes where we'll use the ichdr in xfs_dir3_icfree_hdr
to avoid extra operations to find the bests pointers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:20 -08:00
Christoph Hellwig
200dada700 xfs: devirtualize ->free_hdr_to_disk
Replace the ->free_hdr_to_disk dir ops method with a directly called
xfs_dir2_free_hdr_to_disk helper that takes care of the differences
between the v4 and v5 on-disk format.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:20 -08:00
Christoph Hellwig
5ba30919a6 xfs: devirtualize ->free_hdr_from_disk
Replace the ->free_hdr_from_disk dir ops method with a directly called
xfs_dir_free_hdr_from_disk helper that takes care of the differences
between the v4 and v5 on-disk format.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:20 -08:00
Christoph Hellwig
478c7835cb xfs: move the max dir2 leaf entries count to struct xfs_da_geometry
Move the max leaf entries count towards our structure for dir/attr
geometry parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:20 -08:00
Christoph Hellwig
545910bcc8 xfs: move the dir2 leaf header size to struct xfs_da_geometry
Move the leaf header size towards our structure for dir/attr geometry
parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:20 -08:00
Christoph Hellwig
787b0893ad xfs: add an entries pointer to struct xfs_dir3_icleaf_hdr
All callers of the ->node_tree_p dir operation already have a struct
xfs_dir3_icleaf_hdr from a previous call to xfs_da_leaf_hdr_from_disk at
hand, or just need slight changes to the calling conventions to do so.
Add a pointer to the entries to struct xfs_dir3_icleaf_hdr to clean up
this pattern.  To make this possible the xfs_dir3_leaf_log_ents function
grow a new argument to pass the xfs_dir3_icleaf_hdr that call callers
already have, and xfs_dir2_leaf_lookup_int returns the
xfs_dir3_icleaf_hdr to the callers so that they can later use it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:20 -08:00
Christoph Hellwig
163fbbb356 xfs: devirtualize ->leaf_hdr_to_disk
Replace the ->leaf_hdr_to_disk dir ops method with a directly called
xfs_dir_leaf_hdr_to_disk helper that takes care of the differences
between the v4 and v5 on-disk format.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:20 -08:00
Christoph Hellwig
518425560a xfs: devirtualize ->leaf_hdr_from_disk
Replace the ->leaf_hdr_from_disk dir ops method with a directly called
xfs_dir2_leaf_hdr_from_disk helper that takes care of the differences
between the v4 and v5 on-disk format.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Christoph Hellwig
3b34441309 xfs: move the node header size to struct xfs_da_geometry
Move the node header size field to struct xfs_da_geometry, and remove
the now unused non-directory dir ops infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Christoph Hellwig
51908ca75f xfs: add a btree entries pointer to struct xfs_da3_icnode_hdr
All but two callers of the ->node_tree_p dir operation already have a
xfs_da3_icnode_hdr from a previous call to xfs_da3_node_hdr_from_disk at
hand.  Add a pointer to the btree entries to struct xfs_da3_icnode_hdr
to clean up this pattern.  The two remaining callers now expand the
whole header as well, but that isn't very expensive and not in a super
hot path anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Christoph Hellwig
e1c8af1e02 xfs: devirtualize ->node_hdr_to_disk
Replace the ->node_hdr_to_disk dir ops method with a directly called
xfs_da_node_hdr_to_disk helper that takes care of the v4 vs v5
difference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Christoph Hellwig
f475dc4dc7 xfs: devirtualize ->node_hdr_from_disk
Replace the ->node_hdr_from_disk dir ops method with a directly called
xfs_da_node_hdr_from_disk helper that takes care of the v4 vs v5
difference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Christoph Hellwig
649d9d98c6 xfs: refactor btree node scrubbing
Break up xchk_da_btree_entry and handle looking up leaf node entries
in the attr / dir callbacks, so that only the generic node handling
is left in the common core code.  Note that the checks for the crc
enabled blocks are removed, as the scrubbing code already remaps the
magic numbers earlier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Christoph Hellwig
b16be56187 xfs: use unsigned int for all size values in struct xfs_da_geometry
None of these can ever be negative, so use unsigned types.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Christoph Hellwig
a39f089a25 xfs: move incore structures out of xfs_da_format.h
Move the abstract in-memory version of various btree block headers
out of xfs_da_format.h as they aren't on-disk formats.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:19 -08:00
Dan Carpenter
7f6bcf7c29 xfs: remove a stray tab in xfs_remount_rw()
The extra tab makes the code slightly confusing.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 16:54:18 -08:00
Darrick J. Wong
895e196fb6 xfs: convert EIO to EFSCORRUPTED when log contents are invalid
Convert EIO to EFSCORRUPTED in the logging code when we can determine
that the log contents are invalid.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-10 16:54:18 -08:00
Ritesh Harjani
4d06bfb97e ext4: Add error handling for io_end_vec struct allocation
This patch adds the error handling in case of any memory allocation
failure for io_end_vec. This was missing in original
patch series which enables dioread_nolock for blocksize < pagesize.

Fixes: c8cc88163f ("ext4: Add support for blocksize < pagesize in dioread_nolock")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191106093809.10673-1-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-10 19:52:03 -05:00
Linus Torvalds
a5871fcba4 configfs regression fix for 5.4-rc
- fix a regression from this merge window in the configfs
    symlink handling (Honggang Li)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl3IG00LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPJbBAAw57uaM2mKcUe1syqwsaliqYVGqZ7X+4sMdtwNPAI
 W01aolnmmukcWInlZBqiHFu3+iAVYJND9NvP7ZA+p6/meJfWiIrgWMEVS4X6HFpg
 qsBCP1qh8BvLxCzHC+i9tkgPMpicYjImE/pza99ZtrMHTRC7ao5I3GKbWBIdEpDv
 5NZDbt2g79Y1YfqGRo0xcY0etwpau+iN4rXivjj4qMO1o+Vt4rWlGZehhD2/r2M0
 NGeR3JpYe1MdxvyECMoDI+aWuOiDFrJ/ZfWfTuCskqwTyZ1BtKElBnqS6VFn7yxL
 XqPNwe6Sr9fz7RRZXQ1iH8O/SYct25xVyc6JQlAvcpp0emAUsj2bNAQ3Hj4W3Krw
 h1D5HKvte+CjIqZEnFlqI6GeHawdbSpJoE1VOECS1rXpYhvs5V+2e5RyT2gmj1Pp
 X58Q0xF5Ver2sF0NkhAMTxXL77L8cpRDVljveBXgUh/MzTdPcq1h/RsKXwwVBD23
 Vsmg4qKlBAWJLK1TJ4wvvmmp0N/XzcaNWm7cKCxJDBlzY8LpcVTmUJfGty96mHqb
 cRZLMNcWbtPFSkHfzkYeuWWnhhtgXBmEcTawOd3y0s+jd8wjhRr2wuhL/MSCicJG
 t0A+4/5p4s8H94OAiq9tF8yOAQpvzmrrunW+lyyOQchK4kBBM8rD+cMIRziL+u/2
 B84=
 =kgn+
 -----END PGP SIGNATURE-----

Merge tag 'configfs-for-5.4-2' of git://git.infradead.org/users/hch/configfs

Pull configfs regression fix from Christoph Hellwig:
 "Fix a regression from this merge window in the configfs symlink
  handling (Honggang Li)"

* tag 'configfs-for-5.4-2' of git://git.infradead.org/users/hch/configfs:
  configfs: calculate the depth of parent item
2019-11-10 12:59:34 -08:00
Linus Torvalds
79a64063a8 small fix for an smb3 reconnect bug
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCAAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl3HowUACgkQiiy9cAdy
 T1GBFwwAsUHGIXIeV5gWNuy7XhTYOkH9OsSEOJIT+N/3d/L2QrIcfQ0x4n/RauC9
 S654xqdsPljti231e7LgeV0fZlayFg99YeNGBkorGk/UBxv5pYzVqqUyyMnUhunI
 /TCC4SPzhcKyLjrei/pZ2lTkZoZe34WfAq36E1p5d758g44ypLSlIcMUUznmhOKE
 B4ATOI5wV5zqHUh4pZSkLWMQ/D33XapQZDFJweol6YEaPCz2NjQfsL/MRJ/pg3nL
 Vh7v1jdHiPMbMhqV9pqsmkBSOci+WMQTqwz4yXiKJgKRsOjrcp+64Tpl+yMDfNfk
 1AsW7A2oHxbWkmSwDN14lNMPqGftREJww9pf40BaYGzTSKUvyI7VZNULT9kLCtSQ
 0xvVhrAMtHZSFeDqIf0ndcPS5WXWlHrTRDx25fgFHid3ZkDh5tLcNgODRZMih/xy
 KvdHlKm2y36jdvDxXMoNAna7iwgKEOtYYk/X9oq82icI+6S0p46HEcShx1rqgDXL
 KAWDdnsb
 =8yoq
 -----END PGP SIGNATURE-----

Merge tag '5.4-rc7-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fix from Steve French:
 "Small fix for an smb3 reconnect bug (also marked for stable)"

* tag '5.4-rc7-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
  SMB3: Fix persistent handles reconnect
2019-11-10 11:43:18 -08:00
Darrick J. Wong
2fe4f92834 xfs: refactor "does this fork map blocks" predicate
Replace the open-coded checks for whether or not an inode fork maps
blocks with a macro that will implant the code for us.  This helps us
declutter the bmap code a bit.

Note that I had to use a macro instead of a static inline function
because of C header dependency problems between xfs_inode.h and
xfs_inode_fork.h.

Conversion was performed with the following Coccinelle script:

@@
expression ip, w;
@@

- XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_EXTENTS || XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_BTREE
+ xfs_ifork_has_extents(ip, w)

@@
expression ip, w;
@@

- XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_EXTENTS && XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_BTREE
+ !xfs_ifork_has_extents(ip, w)

@@
expression ip, w;
@@

- XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_BTREE || XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_EXTENTS
+ xfs_ifork_has_extents(ip, w)

@@
expression ip, w;
@@

- XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_BTREE && XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_EXTENTS
+ !xfs_ifork_has_extents(ip, w)

@@
expression ip, w;
@@

- (xfs_ifork_has_extents(ip, w))
+ xfs_ifork_has_extents(ip, w)

@@
expression ip, w;
@@

- (!xfs_ifork_has_extents(ip, w))
+ !xfs_ifork_has_extents(ip, w)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-10 10:22:51 -08:00
Darrick J. Wong
5113f8ec37 xfs: clean up weird while loop in xfs_alloc_ag_vextent_near
Refactor the weird while loop out of existence.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-10 10:21:58 -08:00
Darrick J. Wong
f755979355 xfs: annotate functions that trip static checker locking checks
Add some lock annotations to helper functions that seem to have
unbalanced locking that confuses the static analyzers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-10 10:21:58 -08:00
Joe Perches
cf085a1b5d xfs: Correct comment tyops -> typos
Just fix the typos checkpatch notices...

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10 10:21:57 -08:00
Al Viro
762c69685f ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either
We need to get the underlying dentry of parent; sure, absent the races
it is the parent of underlying dentry, but there's nothing to prevent
losing a timeslice to preemtion in the middle of evaluation of
lower_dentry->d_parent->d_inode, having another process move lower_dentry
around and have its (ex)parent not pinned anymore and freed on memory
pressure.  Then we regain CPU and try to fetch ->d_inode from memory
that is freed by that point.

dentry->d_parent *is* stable here - it's an argument of ->lookup() and
we are guaranteed that it won't be moved anywhere until we feed it
to d_add/d_splice_alias.  So we safely go that way to get to its
underlying dentry.

Cc: stable@vger.kernel.org # since 2009 or so
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-10 11:57:45 -05:00
Al Viro
e72b9dd6a5 ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
lower_dentry can't go from positive to negative (we have it pinned),
but it *can* go from negative to positive.  So fetching ->d_inode
into a local variable, doing a blocking allocation, checking that
now ->d_inode is non-NULL and feeding the value we'd fetched
earlier to a function that won't accept NULL is not a good idea.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-10 11:57:44 -05:00
Al Viro
bcf0d9d4b7 ecryptfs: fix unlink and rmdir in face of underlying fs modifications
A problem similar to the one caught in commit 74dd7c97ea ("ecryptfs_rename():
verify that lower dentries are still OK after lock_rename()") exists for
unlink/rmdir as well.

Instead of playing with dget_parent() of underlying dentry of victim
and hoping it's the same as underlying dentry of our directory,
do the following:
        * find the underlying dentry of victim
        * find the underlying directory of victim's parent (stable
since the victim is ecryptfs dentry and inode of its parent is
held exclusive by the caller).
        * lock the inode of dentry underlying the victim's parent
        * check that underlying dentry of victim is still hashed and
has the right parent - it can be moved, but it can't be moved to/from
the directory we are holding exclusive.  So while ->d_parent itself
might not be stable, the result of comparison is.

If the check passes, everything is fine - underlying directory is locked,
underlying victim is still a child of that directory and we can go ahead
and feed them to vfs_unlink().  As in the current mainline we need to
pin the underlying dentry of victim, so that it wouldn't go negative under
us, but that's the only temporary reference that needs to be grabbed there.
Underlying dentry of parent won't go away (it's pinned by the parent,
which is held by caller), so there's no need to grab it.

The same problem (with the same solution) exists for rmdir.  Moreover,
rename gets simpler and more robust with the same "don't bother with
dget_parent()" approach.

Fixes: 74dd7c97ea "ecryptfs_rename(): verify that lower dentries are still OK after lock_rename()"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-10 11:57:44 -05:00
Al Viro
a2ece08888 exportfs_decode_fh(): negative pinned may become positive without the parent locked
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-10 11:56:05 -05:00
Linus Torvalds
00aff68362 for-5.4-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3GwvEACgkQxWXV+ddt
 WDvsfQ//UQ/TND1roMW3EmN+PLTRTUQS75Hb737r5Q66j0j94gFnBxB03R+tU8lP
 bH5XVpateb1wDmmdscqVQ1WM9O82bQdDNiYeLDr86+kzpLgy61rZHswZfNlDtl5M
 wDwyaxsrd7HndDeUZnIuaakYI9MXz9WIaNXkt0o8hSHctt0N18y23DSBFTVSh/4q
 T3cn4odkoBKtQ4mIn2OSmQvkl69nWRBVpjPJZIvsNszKjOo9aZTuGHrOWUV5RPiE
 Ho9UBkr+IjEDo8OH88vXAsHeYFIoYhEeUltjLHyF6Agwk1/Ajwp5sxXSubbfHUMQ
 l1YdmrTZf+l1Dxdj0sCdyK1npcgGI5IuZmIICpNUEAny9AbTtpSE3GNwtnIHAEAr
 cpki+1Z3lfaVSwNMYUz9Esbsb72C+f08WJHGHBMaOhjrBIwQUeWeYzTx6N7uDwNg
 GjaDRxjqFV2o7373isQaz7WOTatUMNtL1xvhkeceONI9NBXQRjW5rq5zECmr1ix7
 lSTKDzn7yAd6eGBW0T6iYdRl4Bta/zxPiDEF5KDNPugvv23yx17LAOdS4qAiHzbx
 oglra/kz9D4xmKVqH7hpFaaHrzL38G4mz6aMdwBu0M7dkn9AwsXiXWSDb0n7jnK0
 o96gkUBZjUSFBseUFD2s34MikFtrDynEJzPq96JHuWLguaByMcc=
 =CZxY
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few regressions and fixes for stable.

  Regressions:

   - fix a race leading to metadata space leak after task received a
     signal

   - un-deprecate 2 ioctls, marked as deprecated by mistake

  Fixes:

   - fix limit check for number of devices during chunk allocation

   - fix a race due to double evaluation of i_size_read inside max()
     macro, can cause a crash

   - remove wrong device id check in tree-checker"

* tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC
  btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range
  Btrfs: fix race leading to metadata space leak after task received signal
  btrfs: tree-checker: Fix wrong check on max devid
  btrfs: Consider system chunk array size for new SYSTEM chunks
2019-11-09 08:51:37 -08:00
Linus Torvalds
5cb8418cb5 for-linus-2019-11-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3F+DIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkVmD/9FIa092Q6obga0RqA16GlbI85tgtNyRFZU
 neIO/g9R3G/uBTGbiUeXHXHDz9CqUXIRYX7pmI2u0b07iGLRz8oUsOsgyVEfCMen
 VitqwkJJAZ9j9OifyKpLYCZX9ulVDWX5hEz/vm2cNWDkjCbOpXvRuQmkXEzp7RNM
 F7K25PpGLvJHfqC90q9FXXxNDlB2i1M/rh5I7eUqhb6rHmzfJGKCd+H80t+REoB1
 iXAygPj86agQKLOUKZtJjXUjq9Ol/0FD+OKY+eP3EfVv/FJvIeWWYe78WplxJpRD
 BYb9dhLMCSo619WVVy4hNYCPjSfPKVT2cO5QJmVRpgOI1urFuTNgNoIiw06AgvkZ
 09vrlrJZ5A7eFEppuAFQC4WRYKWCQCQfq8wxt2iGUivXgHfskjJ7qJz1Mh5Nlxsr
 JGm1hSVw9UCBzjqC75K2CR+vVt4T8ovEaizPFvzVj6lQ0lRTmSchisxbXzTdevOn
 kFvBOntBdpeSq/CwZ0x5PDP4AbRsgH3ny47LCJoEpFZlxlOLuEB/dAx564dn6ZgZ
 rRz6mKU7rlO7brrkW+DbRZ0XiOn5qzN6FrcSGX1DBzc4+hMus/5PPjv8Gk+Wo1PU
 388mu6N6DObSjw1ij1AqkwyzudmbrziKM4isJY4I72I0YGSq9cuG5VXgM2GqbGlU
 XXpzsu8pSw==
 =8B/z
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Two NVMe device removal crash fixes, and a compat fixup for for an
   ioctl that was introduced in this release (Anton, Charles, Max - via
   Keith)

 - Missing error path mutex unlock for drbd (Dan)

 - cgroup writeback fixup on dead memcg (Tejun)

 - blkcg online stats print fix (Tejun)

* tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block:
  cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead
  block: drbd: remove a stray unlock in __drbd_send_protocol()
  blkcg: make blkcg_print_stat() print stats only for online blkgs
  nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
  nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
  nvme-rdma: fix a segmentation fault during module unload
2019-11-08 18:15:55 -08:00
Tejun Heo
65de03e251 cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead
cgroup writeback tries to refresh the associated wb immediately if the
current wb is dead.  This is to avoid keeping issuing IOs on the stale
wb after memcg - blkcg association has changed (ie. when blkcg got
disabled / enabled higher up in the hierarchy).

Unfortunately, the logic gets triggered spuriously on inodes which are
associated with dead cgroups.  When the logic is triggered on dead
cgroups, the attempt fails only after doing quite a bit of work
allocating and initializing a new wb.

While c3aab9a0bd ("mm/filemap.c: don't initiate writeback if mapping
has no dirty pages") alleviated the issue significantly as it now only
triggers when the inode has dirty pages.  However, the condition can
still be triggered before the inode is switched to a different cgroup
and the logic simply doesn't make sense.

Skip the immediate switching if the associated memcg is dying.

This is a simplified version of the following two patches:

 * https://lore.kernel.org/linux-mm/20190513183053.GA73423@dennisz-mbp/
 * http://lkml.kernel.org/r/156355839560.2063.5265687291430814589.stgit@buzz

Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: e8a7abf5a5 ("writeback: disassociate inodes from dying bdi_writebacks")
Acked-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-08 13:37:24 -07:00
Linus Torvalds
0689acfad3 Some late-breaking dentry handling fixes from Al and Jeff, a patch to
further restrict copy_file_range() to avoid potential data corruption
 from Luis and a fix for !CONFIG_CEPH_FSCACHE kernels.  Everything but
 the fscache fix is marked for stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl3FjtQTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziyjfCACbNiRHpTWfuZ1iHXXCpo0tAixOT6i0
 9ZlE8yiF6U6iH7uv/MuIeJ+Qeep9+6wwgJYEJKcF0e+Raiob7womDO+yeDx6RYrC
 bqq6OLjJl8VbfeQWvJTmisGTPpvOsOKxxXIKG6vx1zJaJ69yENilFZ0biis14iTu
 RRDvtSntc96OhZ3n8IXie2+f9purYeIKhiCZqx9WzYZOk5sb3zGyhqD+Wh5kxkEh
 QzwwZ/G5RYvMMhL0o13uMZrjqozGTI9Qm09u7+EikIXFFtt+szEdgFd4pQPhsC1D
 VmpdrvV4ml8JnbkENiZ7tlHCcICQJfgMWNcLVgHvD6808CJ0UP5Cq/jP
 =34Iz
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.4-rc7' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Some late-breaking dentry handling fixes from Al and Jeff, a patch to
  further restrict copy_file_range() to avoid potential data corruption
  from Luis and a fix for !CONFIG_CEPH_FSCACHE kernels.

  Everything but the fscache fix is marked for stable"

* tag 'ceph-for-5.4-rc7' of git://github.com/ceph/ceph-client:
  ceph: return -EINVAL if given fsc mount option on kernel w/o support
  ceph: don't allow copy_file_range when stripe_count != 1
  ceph: don't try to handle hashed dentries in non-O_CREAT atomic_open
  ceph: add missing check in d_revalidate snapdir handling
  ceph: fix RCU case handling in ceph_d_revalidate()
  ceph: fix use-after-free in __ceph_remove_cap()
2019-11-08 12:31:27 -08:00
Mao Wenan
2a67803e13 nfsd: Drop LIST_HEAD where the variable it declares is never used.
The declarations were introduced with the file, but the declared
variables were not used.

Fixes: 65294c1f2c ("nfsd: add a new struct file caching facility to nfsd")
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-11-08 12:32:59 -05:00
J. Bruce Fields
cc1ce2f13e nfsd: document callback_wq serialization of callback code
The callback code relies on the fact that much of it is only ever called
from the ordered workqueue callback_wq, and this is worth documenting.

Reported-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-11-08 12:32:59 -05:00
J. Bruce Fields
20428a8047 nfsd: mark cb path down on unknown errors
An unexpected error is probably a sign that something is wrong with the
callback path.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-11-08 12:32:59 -05:00
Trond Myklebust
2bbfed98a4 nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()
When we're destroying the client lease, and we call
nfsd4_shutdown_callback(), we must ensure that we do not return
before all outstanding callbacks have terminated and have
released their payloads.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-11-08 12:32:59 -05:00
Trond Myklebust
12357f1b2c nfsd: minor 4.1 callback cleanup
Move all the cb_holds_slot management into helper functions.  No change
in behavior.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-11-08 12:32:44 -05:00
Darrick J. Wong
2b91b28e29 iomap: iomap_bmap should check iomap_apply return value
Coverity caught this fairly minor bug, but we should check the return
value of iomap_apply regardless.

Coverity-id: 1437065
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-08 07:44:12 -08:00
Darrick J. Wong
d6abecb825 xfs: range check ri_cnt when recovering log items
Range check the region counter when we're reassembling regions from log
items during log recovery.  In the old days ASSERT would halt the
kernel, but this isn't true any more so we have to make an explicit
error return.

Coverity-id: 1132508
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-07 13:00:54 -08:00
Darrick J. Wong
120254608f xfs: "optimize" buffer item log segment bitmap setting
Optimize the setting of full words of bits in xfs_buf_item_log_segment.
The optimization is purely within the bug triage process.  No functional
changes.

Coverity-id: 1446793
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-07 13:00:54 -08:00
Darrick J. Wong
f5be08446e xfs: null out bma->prev if no previous extent
Coverity complains that we don't check the return value of
xfs_iext_peek_prev_extent like we do nearly all of the time.  If there
is no previous extent then just null out bma->prev like we do elsewhere
in the bmap code.

Coverity-id: 1424057
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-07 13:00:54 -08:00
Darrick J. Wong
5f213ddbcb xfs: fix missing header includes
Some of the xfs source files are missing header includes, so add them
back.  Sparse complains about non-static functions that don't have a
forward declaration anywhere.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-07 13:00:53 -08:00
Darrick J. Wong
5d1116d4c6 xfs: periodically yield scrub threads to the scheduler
Christoph Hellwig complained about the following soft lockup warning
when running scrub after generic/175 when preemption is disabled and
slub debugging is enabled:

watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [xfs_scrub:161]
Modules linked in:
irq event stamp: 41692326
hardirqs last  enabled at (41692325): [<ffffffff8232c3b7>] _raw_0
hardirqs last disabled at (41692326): [<ffffffff81001c5a>] trace0
softirqs last  enabled at (41684994): [<ffffffff8260031f>] __do_e
softirqs last disabled at (41684987): [<ffffffff81127d8c>] irq_e0
CPU: 3 PID: 16189 Comm: xfs_scrub Not tainted 5.4.0-rc3+ #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.124
RIP: 0010:_raw_spin_unlock_irqrestore+0x39/0x40
Code: 89 f3 be 01 00 00 00 e8 d5 3a e5 fe 48 89 ef e8 ed 87 e5 f2
RSP: 0018:ffffc9000233f970 EFLAGS: 00000286 ORIG_RAX: ffffffffff3
RAX: ffff88813b398040 RBX: 0000000000000286 RCX: 0000000000000006
RDX: 0000000000000006 RSI: ffff88813b3988c0 RDI: ffff88813b398040
RBP: ffff888137958640 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00042b0c00
R13: 0000000000000001 R14: ffff88810ac32308 R15: ffff8881376fc040
FS:  00007f6113dea700(0000) GS:ffff88813bb80000(0000) knlGS:00000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6113de8ff8 CR3: 000000012f290000 CR4: 00000000000006e0
Call Trace:
 free_debug_processing+0x1dd/0x240
 __slab_free+0x231/0x410
 kmem_cache_free+0x30e/0x360
 xchk_ag_btcur_free+0x76/0xb0
 xchk_ag_free+0x10/0x80
 xchk_bmap_iextent_xref.isra.14+0xd9/0x120
 xchk_bmap_iextent+0x187/0x210
 xchk_bmap+0x2e0/0x3b0
 xfs_scrub_metadata+0x2e7/0x500
 xfs_ioc_scrub_metadata+0x4a/0xa0
 xfs_file_ioctl+0x58a/0xcd0
 do_vfs_ioctl+0xa0/0x6f0
 ksys_ioctl+0x5b/0x90
 __x64_sys_ioctl+0x11/0x20
 do_syscall_64+0x4b/0x1a0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

If preemption is disabled, all metadata buffers needed to perform the
scrub are already in memory, and there are a lot of records to check,
it's possible that the scrub thread will run for an extended period of
time without sleeping for IO or any other reason.  Then the watchdog
timer or the RCU stall timeout can trigger, producing the backtrace
above.

To fix this problem, call cond_resched() from the scrub thread so that
we back out to the scheduler whenever necessary.

Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-07 13:00:53 -08:00
Andreas Gruenbacher
184b4e6085 gfs2: Fix end-of-file handling in gfs2_page_mkwrite
When the filesystem block size is smaller than the page size, the last
page may contain blocks that lie entirely beyond the end of the file.
Make sure to only allocate blocks that lie at least partially in the
file.  Allocating blocks beyond that isn't useful, and what's more, they
will not be zeroed out and may end up containing random data.

With that change in place, make sure we'll still always unstuff stuffed
inodes: iomap_writepage and iomap_writepages currently can't handle
stuffed files.

In addition, simplify and move the end-of-file check further to the top
in gfs2_page_mkwrite to avoid weird side effects like unstuffing when
we're not.

Fixes xfstest generic/263.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-07 21:02:35 +01:00
Andreas Gruenbacher
f53056c430 gfs2: Multi-block allocations in gfs2_page_mkwrite
In gfs2_page_mkwrite's gfs2_allocate_page_backing helper, try to
allocate as many blocks at once as we need.  Pass in the size of the
requested allocation.

Fixes: 35af80aef9 ("gfs2: don't use buffer_heads in gfs2_allocate_page_backing")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-07 21:02:02 +01:00
Andreas Gruenbacher
39c3a948ec gfs2: Improve mmap write vs. punch_hole consistency
When punching a hole in a file, use filemap_write_and_wait_range to
write back any dirty pages in the range of the hole.  As a side effect,
if the hole isn't page aligned, this marks unaligned pages at the
beginning and the end of the hole read-only.  This is required when the
block size is smaller than the page size: when those pages are written
to again after the hole punching, we must make sure that page_mkwrite is
called for those pages so that the page will be fully allocated and any
blocks turned into holes from the hole punching will be reallocated.
(If a page is writably mapped, page_mkwrite won't be called.)

Fixes xfstest generic/567.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-07 21:01:04 +01:00
Jens Axboe
912c0a8591 Merge branch 'for-linus' into for-5.5/block
Pull on for-linus to resolve what otherwise would have been a conflict
with the cgroups rstat patchset from Tejun.

* for-linus: (942 commits)
  blkcg: make blkcg_print_stat() print stats only for online blkgs
  nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
  nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
  nvme-rdma: fix a segmentation fault during module unload
  iocost: don't nest spin_lock_irq in ioc_weight_write()
  io_uring: ensure we clear io_kiocb->result before each issue
  um-ubd: Entrust re-queue to the upper layers
  nvme-multipath: remove unused groups_only mode in ana log
  nvme-multipath: fix possible io hang after ctrl reconnect
  io_uring: don't touch ctx in setup after ring fd install
  io_uring: Fix leaked shadow_req
  Linux 5.4-rc5
  riscv: cleanup do_trap_break
  nbd: verify socket is supported during setup
  ata: libahci_platform: Fix regulator_get_optional() misuse
  nbd: handle racing with error'ed out commands
  nbd: protect cmd->status with cmd->lock
  io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
  io_uring: used cached copies of sq->dropped and cq->overflow
  ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157
  ...
2019-11-07 12:27:19 -07:00
Chao Yu
1f0d5c911b f2fs: fix potential overflow
We expect 64-bit calculation result from below statement, however
in 32-bit machine, looped left shift operation on pgoff_t type
variable may cause overflow issue, fix it by forcing type cast.

page->index << PAGE_SHIFT;

Fixes: 26de9b1171 ("f2fs: avoid unnecessary updating inode during fsync")
Fixes: 0a2aa8fbb9 ("f2fs: refactor __exchange_data_block for speed up")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-07 11:17:39 -08:00
Chao Yu
2a60637f06 f2fs: fix to update dir's i_pino during cross_rename
As Eric reported:

RENAME_EXCHANGE support was just added to fsstress in xfstests:

	commit 65dfd40a97b6bbbd2a22538977bab355c5bc0f06
	Author: kaixuxia <xiakaixu1987@gmail.com>
	Date:   Thu Oct 31 14:41:48 2019 +0800

	    fsstress: add EXCHANGE renameat2 support

This is causing xfstest generic/579 to fail due to fsck.f2fs reporting errors.
I'm not sure what the problem is, but it still happens even with all the
fs-verity stuff in the test commented out, so that the test just runs fsstress.

generic/579 23s ... 	[10:02:25]
[    7.745370] run fstests generic/579 at 2019-11-04 10:02:25
_check_generic_filesystem: filesystem on /dev/vdc is inconsistent
(see /results/f2fs/results-default/generic/579.full for details)
 [10:02:47]
Ran: generic/579
Failures: generic/579
Failed 1 of 1 tests
Xunit report: /results/f2fs/results-default/result.xml

Here's the contents of 579.full:

_check_generic_filesystem: filesystem on /dev/vdc is inconsistent
*** fsck.f2fs output ***
[ASSERT] (__chk_dots_dentries:1378)  --> Bad inode number[0x24] for '..', parent parent ino is [0xd10]

The root cause is that we forgot to update directory's i_pino during
cross_rename, fix it.

Fixes: 32f9bc25cb ("f2fs: support ->rename2()")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Tested-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-07 11:15:39 -08:00
Jens Axboe
5f8fd2d3e0 io_uring: properly mark async work as bounded vs unbounded
Now that io-wq supports separating the two request lifetime types, mark
the following IO as having unbounded runtimes:

- Any read/write to a non-regular file
- Any specific networked IO
- Any poll command

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 11:57:17 -07:00
Jens Axboe
c5def4ab84 io-wq: add support for bounded vs unbunded work
io_uring supports request types that basically have two different
lifetimes:

1) Bounded completion time. These are requests like disk reads or writes,
   which we know will finish in a finite amount of time.
2) Unbounded completion time. These are generally networked IO, where we
   have no idea how long they will take to complete. Another example is
   POLL commands.

This patch provides support for io-wq to handle these differently, so we
don't starve bounded requests by tying up workers for too long. By default
all work is bounded, unless otherwise specified in the work item.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 11:41:35 -07:00
Jens Axboe
91d666ea43 io-wq: io_wqe_run_queue() doesn't need to use list_empty_careful()
We hold the wqe lock at this point (which is also annotated), so there's
no need to use the careful variant of list_empty().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-09 11:45:32 -07:00
Jens Axboe
1d7bb1d50f io_uring: add support for backlogged CQ ring
Currently we drop completion events, if the CQ ring is full. That's fine
for requests with bounded completion times, but it may make it harder or
impossible to use io_uring with networked IO where request completion
times are generally unbounded. Or with POLL, for example, which is also
unbounded.

After this patch, we never overflow the ring, we simply store requests
in a backlog for later flushing. This flushing is done automatically by
the kernel. To prevent the backlog from growing indefinitely, if the
backlog is non-empty, we apply back pressure on IO submissions. Any
attempt to submit new IO with a non-empty backlog will get an -EBUSY
return from the kernel. This is a signal to the application that it has
backlogged CQ events, and that it must reap those before being allowed
to submit more IO.

Note that if we do return -EBUSY, we will have filled whatever
backlogged events into the CQ ring first, if there's room. This means
the application can safely reap events WITHOUT entering the kernel and
waiting for them, they are already available in the CQ ring.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-09 11:45:29 -07:00
Jens Axboe
78e19bbef3 io_uring: pass in io_kiocb to fill/add CQ handlers
This is in preparation for handling CQ ring overflow a bit smarter. We
should not have any functional changes in this patch. Most of the
changes are fairly straight forward, the only ones that stick out a bit
are the ones that change __io_free_req() to take the reference count
into account. If the request hasn't been submitted yet, we know it's
safe to simply ignore references and free it. But let's clean these up
too, as later patches will depend on the caller doing the right thing if
the completion logging grabs a reference to the request.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-08 06:57:27 -07:00
Jens Axboe
84f97dc233 io_uring: make io_cqring_events() take 'ctx' as argument
The rings can be derived from the ctx, and we need the ctx there for
a future change.

No functional changes in this patch.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-08 06:57:21 -07:00
Jens Axboe
2665abfd75 io_uring: add support for linked SQE timeouts
While we have support for generic timeouts, we don't have a way to tie
a timeout to a specific SQE. The generic timeouts simply trigger wakeups
on the CQ ring.

This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
as a link to a previous command. The timeout specific can be either
relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
the timeout triggers before the dependent command completes, it will
attempt to cancel that command. Likewise, if the dependent command
completes before the timeout triggers, it will cancel the timeout.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 19:12:40 -07:00
Jens Axboe
e977d6d34f io_uring: abstract out io_async_cancel_one() helper
We're going to need this helper in a future patch, so move it out
of io_async_cancel() and into its own separate function.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 12:31:31 -07:00
Jaegeuk Kim
f5a53edcf0 f2fs: support aligned pinned file
This patch supports 2MB-aligned pinned file, which can guarantee no GC at all
by allocating fully valid 2MB segment.

Check free segments by has_not_enough_free_secs() with large budget.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-07 10:40:59 -08:00
Jaegeuk Kim
bc005a4d53 f2fs: avoid kernel panic on corruption test
xfstests/generic/475 complains kernel warn/panic while testing corrupted disk.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-11-07 10:40:59 -08:00
Jeff Layton
ff29fde84d ceph: return -EINVAL if given fsc mount option on kernel w/o support
If someone requests fscache on the mount, and the kernel doesn't
support it, it should fail the mount.

[ Drop ceph prefix -- it's provided by pr_err. ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-07 18:03:23 +01:00
Andreas Gruenbacher
add66fcbd3 iomap: Fix overflow in iomap_page_mkwrite
On architectures where loff_t is wider than pgoff_t, the expression
((page->index + 1) << PAGE_SHIFT) can overflow.  Rewrite to use the page
offset, which we already compute here anyway.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-07 07:28:18 -08:00
Ajay Joshi
6c1b1da58f block: add zone open, close and finish operations
Zoned block devices (ZBC and ZAC devices) allow an explicit control
over the condition (state) of zones. The operations allowed are:
* Open a zone: Transition to open condition to indicate that a zone will
  actively be written
* Close a zone: Transition to closed condition to release the drive
  resources used for writing to a zone
* Finish a zone: Transition an open or closed zone to the full
  condition to prevent write operations

To enable this control for in-kernel zoned block device users, define
the new request operations REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE
and REQ_OP_ZONE_FINISH as well as the generic function
blkdev_zone_mgmt() for submitting these operations on a range of zones.
This results in blkdev_reset_zones() removal and replacement with this
new zone magement function. Users of blkdev_reset_zones() (f2fs and
dm-zoned) are updated accordingly.

Contains contributions from Matias Bjorling, Hans Holmberg,
Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig.

Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 06:31:48 -07:00
Chengguang Xu
355b9aae86 ext2: fix improper function comment
Just fix a improper function comment.

Link: https://lore.kernel.org/r/20191104114036.9893-5-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-07 12:35:27 +01:00
Chengguang Xu
44dd616133 ext2: code cleanup for ext2_try_to_allocate()
Code cleanup by removing duplicated code.

Link: https://lore.kernel.org/r/20191104114036.9893-4-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-07 12:35:27 +01:00
Chengguang Xu
ac3820f822 ext2: skip unnecessary operations in ext2_try_to_allocate()
Move 'repeat' tag to proper place so that we can
skip unnecessary operations in ext2_try_to_allocate().

Link: https://lore.kernel.org/r/20191104114036.9893-3-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-07 12:35:27 +01:00
Jan Kara
cf4eb321b3 ext2: Simplify initialization in ext2_try_to_allocate()
Somewhat simplify the logic initializing search start and end in
ext2_try_to_allocate(). No functional change.

Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-07 12:35:22 +01:00
Pavel Shilovsky
d243af7ab9 SMB3: Fix persistent handles reconnect
When the client hits a network reconnect, it re-opens every open
file with a create context to reconnect a persistent handle. All
create context types should be 8-bytes aligned but the padding
was missed for that one. As a result, some servers don't allow
us to reconnect handles and return an error. The problem occurs
when the problematic context is not at the end of the create
request packet. Fix this by adding a proper padding at the end
of the reconnect persistent handle context.

Cc: Stable <stable@vger.kernel.org> # 4.19.x
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-06 21:32:18 -06:00
Pavel Begunkov
267bc90442 io_uring: use inlined struct sqe_submit
req->submit is always up-to-date, use it directly

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06 20:23:02 -07:00
Pavel Begunkov
50585b9a07 io_uring: Use submit info inlined into req
Stack allocated struct sqe_submit is passed down to the submission path
along with a request (a.k.a. struct io_kiocb), and will be copied into
req->submit for async requests.

As space for it is already allocated, fill req->submit in the first
place instead of using on-stack one. As a result:

1. sqe->submit is the only place for sqe_submit and is always valid,
so we don't need to track which one to use.
2. don't need to copy in case of async
3. allows to simplify the code by not carrying it as an argument all
the way down
4. allows to reduce number of function arguments / potentially improve
spilling

The downside is that stack is most probably be cached, that's not true
for just allocated memory for a request. Another concern is cache
pollution. Though, a request would be touched and fetched along with
req->submit at some point anyway, so shouldn't be a problem.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06 20:23:00 -07:00
Pavel Begunkov
196be95cd5 io_uring: allocate io_kiocb upfront
Let io_submit_sqes() to allocate io_kiocb before fetching an sqe.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06 20:21:37 -07:00
Eric Biggers
0eee17e332 f2fs: add support for IV_INO_LBLK_64 encryption policies
f2fs inode numbers are stable across filesystem resizing, and f2fs inode
and file logical block numbers are always 32-bit.  So f2fs can always
support IV_INO_LBLK_64 encryption policies.  Wire up the needed
fscrypt_operations to declare support.

Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-06 12:34:42 -08:00
Eric Biggers
b925acb8f8 ext4: add support for IV_INO_LBLK_64 encryption policies
IV_INO_LBLK_64 encryption policies have special requirements from the
filesystem beyond those of the existing encryption policies:

- Inode numbers must never change, even if the filesystem is resized.
- Inode numbers must be <= 32 bits.
- File logical block numbers must be <= 32 bits.

ext4 has 32-bit inode and file logical block numbers.  However,
resize2fs can re-number inodes when shrinking an ext4 filesystem.

However, typically the people who would want to use this format don't
care about filesystem shrinking.  They'd be fine with a solution that
just prevents the filesystem from being shrunk.

Therefore, add a new feature flag EXT4_FEATURE_COMPAT_STABLE_INODES that
will do exactly that.  Then wire up the fscrypt_operations to expose
this flag to fs/crypto/, so that it allows IV_INO_LBLK_64 policies when
this flag is set.

Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-06 12:34:42 -08:00
Eric Biggers
b103fb7653 fscrypt: add support for IV_INO_LBLK_64 policies
Inline encryption hardware compliant with the UFS v2.1 standard or with
the upcoming version of the eMMC standard has the following properties:

(1) Per I/O request, the encryption key is specified by a previously
    loaded keyslot.  There might be only a small number of keyslots.

(2) Per I/O request, the starting IV is specified by a 64-bit "data unit
    number" (DUN).  IV bits 64-127 are assumed to be 0.  The hardware
    automatically increments the DUN for each "data unit" of
    configurable size in the request, e.g. for each filesystem block.

Property (1) makes it inefficient to use the traditional fscrypt
per-file keys.  Property (2) precludes the use of the existing
DIRECT_KEY fscrypt policy flag, which needs at least 192 IV bits.

Therefore, add a new fscrypt policy flag IV_INO_LBLK_64 which causes the
encryption to modified as follows:

- The encryption keys are derived from the master key, encryption mode
  number, and filesystem UUID.

- The IVs are chosen as (inode_number << 32) | file_logical_block_num.
  For filenames encryption, file_logical_block_num is 0.

Since the file nonces aren't used in the key derivation, many files may
share the same encryption key.  This is much more efficient on the
target hardware.  Including the inode number in the IVs and mixing the
filesystem UUID into the keys ensures that data in different files is
nevertheless still encrypted differently.

Additionally, limiting the inode and block numbers to 32 bits and
placing the block number in the low bits maintains compatibility with
the 64-bit DUN convention (property (2) above).

Since this scheme assumes that inode numbers are stable (which may
preclude filesystem shrinking) and that inode and file logical block
numbers are at most 32-bit, IV_INO_LBLK_64 will only be allowed on
filesystems that meet these constraints.  These are acceptable
limitations for the cases where this format would actually be used.

Note that IV_INO_LBLK_64 is an on-disk format, not an implementation.
This patch just adds support for it using the existing filesystem layer
encryption.  A later patch will add support for inline encryption.

Reviewed-by: Paul Crowley <paulcrowley@google.com>
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-06 12:34:36 -08:00
Eric Biggers
ff73c2c016 fscrypt: avoid data race on fscrypt_mode::logged_impl_name
The access to logged_impl_name is technically a data race, which tools
like KCSAN could complain about in the future.  See:
https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE

Fix by using xchg(), which also ensures that only one thread does the
logging.

This also required switching from bool to int, to avoid a build error on
the RISC-V architecture which doesn't implement xchg on bytes.

Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-06 12:33:15 -08:00
Pavel Begunkov
e5eb6366ac io_uring: io_queue_link*() right after submit
After a call to io_submit_sqe(), it's already known whether it needs
to queue a link or not. Do it there, as it's simplier and doesn't keep
an extra variable across the loop.

Reviewed-by:Bob Liu <bob.liu@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06 11:20:11 -07:00
Pavel Begunkov
ae9428ca61 io_uring: Merge io_submit_sqes and io_ring_submit
io_submit_sqes() and io_ring_submit() are doing the same stuff with
a little difference. Deduplicate them.

Reviewed-by:Bob Liu <bob.liu@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06 11:20:07 -07:00
Honggang Li
e2f238f7d5 configfs: calculate the depth of parent item
When create symbolic link, create_link should calculate the depth
of the parent item. However, both the first and second parameters
of configfs_get_target_path had been set to the target. Broken
symbolic link created.

$ targetcli ls /
o- / ............................................................. [...]
  o- backstores .................................................. [...]
  | o- block ...................................... [Storage Objects: 0]
  | o- fileio ..................................... [Storage Objects: 2]
  | | o- vdev0 .......... [/dev/ramdisk1 (16.0MiB) write-thru activated]
  | | | o- alua ....................................... [ALUA Groups: 1]
  | | |   o- default_tg_pt_gp ........... [ALUA state: Active/optimized]
  | | o- vdev1 .......... [/dev/ramdisk2 (16.0MiB) write-thru activated]
  | |   o- alua ....................................... [ALUA Groups: 1]
  | |     o- default_tg_pt_gp ........... [ALUA state: Active/optimized]
  | o- pscsi ...................................... [Storage Objects: 0]
  | o- ramdisk .................................... [Storage Objects: 0]
  o- iscsi ................................................ [Targets: 0]
  o- loopback ............................................. [Targets: 0]
  o- srpt ................................................. [Targets: 2]
  | o- ib.e89a8f91cb3200000000000000000000 ............... [no-gen-acls]
  | | o- acls ................................................ [ACLs: 2]
  | | | o- ib.e89a8f91cb3200000000000000000000 ........ [Mapped LUNs: 2]
  | | | | o- mapped_lun0 ............................. [BROKEN LUN LINK]
  | | | | o- mapped_lun1 ............................. [BROKEN LUN LINK]
  | | | o- ib.e89a8f91cb3300000000000000000000 ........ [Mapped LUNs: 2]
  | | |   o- mapped_lun0 ............................. [BROKEN LUN LINK]
  | | |   o- mapped_lun1 ............................. [BROKEN LUN LINK]
  | | o- luns ................................................ [LUNs: 2]
  | |   o- lun0 ...... [fileio/vdev0 (/dev/ramdisk1) (default_tg_pt_gp)]
  | |   o- lun1 ...... [fileio/vdev1 (/dev/ramdisk2) (default_tg_pt_gp)]
  | o- ib.e89a8f91cb3300000000000000000000 ............... [no-gen-acls]
  |   o- acls ................................................ [ACLs: 0]
  |   o- luns ................................................ [LUNs: 0]
  o- vhost ................................................ [Targets: 0]

Fixes: e9c03af21c ("configfs: calculate the symlink target only once")
Signed-off-by: Honggang Li <honli@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-11-06 18:36:01 +01:00
Shuning Zhang
e74540b285 ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()
When the extent tree is modified, it should be protected by inode
cluster lock and ip_alloc_sem.

The extent tree is accessed and modified in the
ocfs2_prepare_inode_for_write, but isn't protected by ip_alloc_sem.

The following is a case.  The function ocfs2_fiemap is accessing the
extent tree, which is modified at the same time.

  kernel BUG at fs/ocfs2/extent_map.c:475!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: tun ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue [...]
  CPU: 16 PID: 14047 Comm: o2info Not tainted 4.1.12-124.23.1.el6uek.x86_64 #2
  Hardware name: Oracle Corporation ORACLE SERVER X7-2L/ASM, MB MECH, X7-2L, BIOS 42040600 10/19/2018
  task: ffff88019487e200 ti: ffff88003daa4000 task.ti: ffff88003daa4000
  RIP: ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
  Call Trace:
    ocfs2_fiemap+0x1e3/0x430 [ocfs2]
    do_vfs_ioctl+0x155/0x510
    SyS_ioctl+0x81/0xa0
    system_call_fastpath+0x18/0xd8
  Code: 18 48 c7 c6 60 7f 65 a0 31 c0 bb e2 ff ff ff 48 8b 4a 40 48 8b 7a 28 48 c7 c2 78 2d 66 a0 e8 38 4f 05 00 e9 28 fe ff ff 0f 1f 00 <0f> 0b 66 0f 1f 44 00 00 bb 86 ff ff ff e9 13 fe ff ff 66 0f 1f
  RIP  ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
  ---[ end trace c8aa0c8180e869dc ]---
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: disabled

This issue can be reproduced every week in a production environment.

This issue is related to the usage mode.  If others use ocfs2 in this
mode, the kernel will panic frequently.

[akpm@linux-foundation.org: coding style fixes]
[Fix new warning due to unused function by removing said function - Linus ]
Link: http://lkml.kernel.org/r/1568772175-2906-2-git-send-email-sunny.s.zhang@oracle.com
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-06 08:47:08 -08:00
Colin Ian King
0279c71fe0 xfs: remove redundant assignment to variable error
Variable error is being initialized with a value that is never read
and is being re-assigned a couple of statements later on. The
assignment is redundant and hence can be removed.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-06 08:07:46 -08:00
Darrick J. Wong
8ef34723ef xfs: add missing early termination checks to record scrubbing functions
Scrubbing directories, quotas, and fs counters all involve iterating
some collection of metadata items.  The per-item scrub functions for
these three are missing some of the components they need to be able to
check for a fatal signal and terminate early.

Per-item scrub functions need to call xchk_should_terminate to look for
fatal signals, and they need to check the scrub context's corruption
flag because there's no point in continuing a scan once we've decided
the data structure is bad.  Add both of these where missing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-06 08:07:41 -08:00
Chengguang Xu
90f3741c2b ext2: code cleanup by calling ext2_group_last_block_no()
Call common helper ext2_group_last_block_no() to
calculate group last block number.

Link: https://lore.kernel.org/r/20191104114036.9893-2-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-06 16:25:15 +01:00