Commit Graph

52772 Commits

Author SHA1 Message Date
Nikolay Borisov
d02c0e2019 btrfs: Remove root argument from cow_file_range_inline
This argument is always set to the root of the inode, which is also
passed. So let's get a reference inside the function and simplify
the arg list.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:42 +02:00
Liu Bo
895a72be41 Btrfs: send: fix typo in TLV_PUT
According to tlv_put()'s prototype, data and attrlen needs to be
exchanged in the macro, but seems all callers are already aware of
this misorder and are therefore not affected.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:42 +02:00
Nikolay Borisov
e5b84f7a25 btrfs: Remove root argument from btrfs_log_dentry_safe
Now that nothing uses the root arg of btrfs_log_dentry_safe it can be
safely removed. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:42 +02:00
Nikolay Borisov
f882274b2d btrfs: Remove root arg from btrfs_log_inode_parent
btrfs_log_inode_parent is called from 2 places (btrfs_log_dentry_safe
and btrfs_log_new_name) both of which pass inode->root as the root
argument and the inode itself. Remove the redundant root argument and
get a reference to the root directly from the inode, also remove
redundant root != inode->root check from the same function. No
functional change.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:42 +02:00
Nikolay Borisov
448f3a17ac btrfs: Remove redundant comment from btrfs_search_forward
This function always sets keep_locks to 1 and saves the old value of
keep_locks which is restored at the end. So there is no way it can be
called without keep_locks being set. Remove comment imposing redundant
requirement on callers.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:41 +02:00
David Sterba
738c93d42c btrfs: move btrfs_listxattr prototype to xattr.h
There's a proper header for xattr handlers.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:41 +02:00
David Sterba
bcadd7050a btrfs: adjust return type of btrfs_getxattr
The xattr_handler::get prototype returns int, use it. The only ssize_t
exception is the per-inode listxattr handler.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:41 +02:00
David Sterba
ab0d093616 btrfs: drop extern from function declarations
Extern for functions does not make any difference, there are only a few
so let's remove them before it's too late.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:41 +02:00
David Sterba
7852781d94 btrfs: drop underscores from exported xattr functions
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:41 +02:00
Filipe Manana
ffa7c4296e Btrfs: send, do not issue unnecessary truncate operations
When send finishes processing an inode representing a regular file, it
always issues a truncate operation for that file, even if its size did
not change or the last write sets the file size correctly. In the most
common cases, the issued write operations set the file to correct size
(either full or incremental sends) or the file size did not change (for
incremental sends), so the only case where a truncate operation is needed
is when a file size becomes smaller in the send snapshot when compared
to the parent snapshot.

By not issuing unnecessary truncate operations we reduce the stream size
and save time in the receiver. Currently truncating a file to the same
size triggers writeback of its last page (if it's dirty) and waits for it
to complete (only if the file size is not aligned with the filesystem's
sector size). This is being fixed by another patch and is independent of
this change (that patch's title is "Btrfs: skip writeback of last page
when truncating file to same size").

The following script was used to measure time spent by a receiver without
this change applied, with this change applied, and without this change and
with the truncate fix applied (the fix to not make it start and wait for
writeback to complete).

  $ cat test_send.sh
  #!/bin/bash

  SRC_DEV=/dev/sdc
  DST_DEV=/dev/sdd
  SRC_MNT=/mnt/sdc
  DST_MNT=/mnt/sdd

  mkfs.btrfs -f $SRC_DEV >/dev/null
  mkfs.btrfs -f $DST_DEV >/dev/null
  mount $SRC_DEV $SRC_MNT
  mount $DST_DEV $DST_MNT

  echo "Creating source filesystem"
  for ((t = 0; t < 10; t++)); do
      (
          for ((i = 1; i <= 20000; i++)); do
              xfs_io -f -c "pwrite -S 0xab 0 5000" \
                  $SRC_MNT/file_$i > /dev/null
          done
      ) &
     worker_pids[$t]=$!
  done
  wait ${worker_pids[@]}

  echo "Creating and sending snapshot"
  btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null
  /usr/bin/time -f "send took %e seconds"    \
         btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1
  /usr/bin/time -f "receive took %e seconds" \
         btrfs receive -f $SRC_MNT/send_file $DST_MNT

  umount $SRC_MNT
  umount $DST_MNT

The results, which are averages for 5 runs for each case, were the
following:

* Without this change

average receive time was 26.49 seconds
standard deviation of 2.53 seconds

* Without this change and with the truncate fix

average receive time was 12.51 seconds
standard deviation of 0.32 seconds

* With this change and without the truncate fix

average receive time was 10.02 seconds
standard deviation of 1.11 seconds

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:41 +02:00
Filipe Manana
213e8c5520 Btrfs: skip writeback of last page when truncating file to same size
When we truncate a file to the same size and that size is not aligned
with the sector size, we end up triggering writeback (and wait for it to
complete) of the last page. This is unncessary as we can not have delayed
allocation beyond the inode's i_size and the goal of truncating a file
to its own size is to discard prealloc extents (allocated via the
fallocate(2) system call). Besides the unnecessary IO start and wait, it
also breaks the oppurtunity for larger contiguous extents on disk, as
before the last dirty page there might be other dirty pages.

This scenario is probably not very common in general, however it is
common for btrfs receive implementations because currently the send
stream always issues a truncate operation for each processed inode as
the last operation for that inode (this truncate operation is not
always needed and the send implementation will be addressed to avoid
them).

So improve this by not starting and waiting for writeback of the inode's
last page when we are truncating to exactly the same size.

The following script was used to quickly measure the time a receive
operation takes:

 $ cat test_send.sh
 #!/bin/bash

 SRC_DEV=/dev/sdc
 DST_DEV=/dev/sdd
 SRC_MNT=/mnt/sdc
 DST_MNT=/mnt/sdd

 mkfs.btrfs -f $SRC_DEV >/dev/null
 mkfs.btrfs -f $DST_DEV >/dev/null
 mount $SRC_DEV $SRC_MNT
 mount $DST_DEV $DST_MNT

 echo "Creating source filesystem"
 for ((t = 0; t < 10; t++)); do
     (
         for ((i = 1; i <= 20000; i++)); do
             xfs_io -f -c "pwrite -S 0xab 0 5000" \
                $SRC_MNT/file_$i > /dev/null
         done
     ) &
     worker_pids[$t]=$!
 done
 wait ${worker_pids[@]}

 echo "Creating and sending snapshot"
 btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null
 /usr/bin/time -f "send took %e seconds"    \
     btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1
 /usr/bin/time -f "receive took %e seconds" \
     btrfs receive -f $SRC_MNT/send_file $DST_MNT

 umount $SRC_MNT
 umount $DST_MNT

The results for 5 runs were the following:

* Without this change

average receive time was 26.49 seconds
standard deviation of 2.53 seconds

* With this change

average receive time was 12.51 seconds
standard deviation of 0.32 seconds

Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:40 +02:00
Liu Bo
ed5d5f37e6 Btrfs: dev-replace: skip prealloc extents when copy nocow pages
It doens't make sense to process prealloc extents as pages will be
filled with zero when reading prealloc extents.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:40 +02:00
Anand Jain
d612ac59ef btrfs: unify types for metadata_ratio and data_chunk_allocations
We have btrfs_fs_info::data_chunk_allocations and
btrfs_fs_info::metadata_ratio declared as unsigned which would be
unsinged int and kernel style prefers unsigned int over bare unsigned.
So this patch changes them to u32.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:40 +02:00
Nikolay Borisov
de224b7c56 btrfs: Remove redundant memory barriers around dio_private error status
Using any kind of memory barriers around atomic operations which have
a return value is redundant, since those operations themselves are
fully ordered. atomic_t.txt states:

    - RMW operations that have a return value are fully ordered;

    Fully ordered primitives are ordered against everything prior and
    everything subsequent. Therefore a fully ordered primitive is like
    having an smp_mb() before and an smp_mb() after the primitive.

Given this let's replace the extra memory barriers with comments.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:40 +02:00
Anand Jain
16db5758fe btrfs: remove assert in btrfs_init_dev_replace_tgtdev()
In the same function we just ran btrfs_alloc_device() which means the
btrfs_device::resized_list is sure to be empty and we are protected
with the btrfs_fs_info::volume_mutex.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:40 +02:00
David Sterba
e67c718b5b btrfs: add more __cold annotations
The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.

Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:

- printf wrappers, error messages
- exit helpers

Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
David Sterba
ffc5a3794f btrfs: add (the only possible) __exit annotation
Recently, the __init annotations have been added. There's unfortunatelly
only one case where we can add __exit, because most of the cleanup
helpers are also called from the __init phase.

As the __exit annotated functions get discarded completely for a
built-in code, we'd miss them from the init phase.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
Anand Jain
ccb0e7d1c1 btrfs: verify subvolid mount parameter
We aren't verifying the parameter passed to the subvolid mount option,
so we won't report and fail the mount if a junk value is specified for
example, -o subvolid=abc.
This patch verifies the subvolid option with match_u64.

Up to now the memparse function accepts the K/M/G/ suffixes, that are
usually meant for size values and do not make sense for a subvolume it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
Liu Bo
5811375325 Btrfs: fix unexpected cow in run_delalloc_nocow
Fstests generic/475 provides a way to fail metadata reads while
checking if checksum exists for the inode inside run_delalloc_nocow(),
and csum_exist_in_range() interprets error (-EIO) as inode having
checksum and makes its caller enter the cow path.

In case of free space inode, this ends up with a warning in
cow_file_range().

The same problem applies to btrfs_cross_ref_exist() since it may also
read metadata in between.

With this, run_delalloc_nocow() bails out when errors occur at the two
places.

cc: <stable@vger.kernel.org> v2.6.28+
Fixes: 17d217fe97 ("Btrfs: fix nodatasum handling in balancing code")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
Nikolay Borisov
9678c54388 btrfs: Remove custom crc32c init code
The custom crc32 init code was introduced in
14a958e678 ("Btrfs: fix btrfs boot when compiled as built-in") to
enable using btrfs as a built-in. However, later as pointed out by
60efa5eb2e ("Btrfs: use late_initcall instead of module_init") this
wasn't enough and finally btrfs was switched to late_initcall which
comes after the generic crc32c implementation is initiliased. The
latter commit superseeded the former. Now that we don't have to
maintain our own code let's just remove it and switch to using the
generic implementation.

Despite touching a lot of files the patch is really simple. Here is the gist of
the changes:

1. Select LIBCRC32C rather than the low-level modules.
2. s/btrfs_crc32c/crc32c/g
3. replace hash.h with linux/crc32c.h
4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.

I've tested this with btrfs being both a module and a built-in and xfstest
doesn't complain.

Does seem to fix the longstanding problem of not automatically selectiong
the crc32c module when btrfs is used. Possibly there is a workaround in
dracut.

The modinfo confirms that now all the module dependencies are there:

before:
depends:        zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate

after:
depends:        libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add more info to changelog from mails ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
Qu Wenruo
3e72ee8874 btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()
Function __get_raid_index() is used to convert block group flags into
raid index, which can be used to get various info directly from
btrfs_raid_array[].

Refactor this function a little:

1) Rename to btrfs_bg_flags_to_raid_index()
   Double underline prefix is normally for internal functions, while the
   function is used by both extent-tree and volumes.

   Although the name is a little longer, but it should explain its usage
   quite well.

2) Move it to volumes.h and make it static inline
   Just several if-else branches, really no need to define it as a normal
   function.

   This also makes later code re-use between kernel and btrfs-progs
   easier.

3) Remove function get_block_group_index()
   Really no need to do such a simple thing as an exported function.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:38 +02:00
Qu Wenruo
2f659546c9 btrfs: tree-checker: Replace root parameter with fs_info
When inspecting the error message with real corruption, the "root=%llu"
always shows "1" (root tree), instead of the correct owner.

The problem is that we are getting @root from page->mapping->host, which
points the same btree inode, so we will always get the same root.

This makes the root owner output meaningless, and harder to port
tree-checker to btrfs-progs.

So get rid of the false and meaningless @root parameter and replace it
with @fs_info.
To get the owner, we can only rely on btrfs_header_owner() now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:38 +02:00
Liu Bo
393da91819 Btrfs: add tracepoint for em's EEXIST case
This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the
subtle bugs around merge_extent_mapping.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:38 +02:00
Nikolay Borisov
5d23515be6 btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable
Currently btrfs_run_qgroups is doing a bit too much. Not only is it
responsible for synchronizing in-memory state of qgroups to disk but
it also contains code to trigger the initial qgroup rescan when
quota is enabled initially. This condition is detected by checking that
BTRFS_FS_QUOTA_ENABLED is not set and BTRFS_FS_QUOTA_ENABLING is set.
Nothing really requires from the code to be structured (and scattered)
the way it is so let's streamline things. First move the quota rescan
code into btrfs_quota_enable, where its invocation is closer to the
use. This also makes the FS_QUOTA_ENABLING flag redundant so let's
remove it as well.

This has been tested with a full xfstest run with qgroups enabled on
the scratch device of every xfstest and no regressions were observed.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:38 +02:00
Gu JinXiang
7ce311d552 btrfs: use reada direction enum instead of constant value in load_free_space_tree
load_free_space_tree calls either function load_free_space_bitmaps or
load_free_space_extents. And either of those two will lead to call
btrfs_next_item.  So in function load_free_space_tree, use READA_FORWARD
to read forward ahead.

This also changes the value from READA_BACK to READA_FORWARD, since
according to the logic, it should reada_for_search forward, not
backward.

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:37 +02:00
Gu Jinxiang
019599ada7 btrfs: use reada direction enum instead of constant value in populate_free_space_tree
populate_free_space_tree calls function btrfs_search_slot_for_read with
parameter int find_higher = 1, it means that, if no exact match is
found, then use the next higher item.  So in function
populate_free_space_tree, use READA_FORWARD to read forward ahead.

This also changes the value from READA_BACK to READA_FORWARD, since
according to the logic, it should reada_for_search forward, not
backward.

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:37 +02:00
Nikolay Borisov
c1c3fac2a9 btrfs: Remove btrfs_inode::delayed_iput_count
delayed_iput_count wa supposed to be used to implement, well, delayed
iput. The idea is that we keep accumulating the number of iputs we do
until eventually the inode is deleted. Turns out we never really
switched the delayed_iput_count from 0 to 1, hence all conditional
code relying on the value of that member being different than 0 was
never executed. This, as it turns out, didn't cause any problem due
to the simple fact that the generic inode's i_count member was always
used to count the number of iputs. So let's just remove the unused
member and all unused code. This patch essentially provides no
functional changes. While at it, also add proper documentation for
btrfs_add_delayed_iput

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:37 +02:00
Qu Wenruo
793ff2c88c btrfs: volumes: Cleanup stripe size calculation
Cleanup the following things:
1) open coded SZ_16M round up
2) use min() to replace open-coded size comparison
3) code style

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:37 +02:00
Nikolay Borisov
da07d4ab33 btrfs: Streamline btrfs_delalloc_reserve_metadata initial operations
The behavior of btrfs_delalloc_reserve_metadata depends on whether
the inode we are allocating for is the freespace inode or not. As it
stands if we are the free node we set 'flush' and 'delalloc_lock'
variable to certain values. Subsequently we check the values of those
vars and act accordingly. Instead, simplify things by having 1 if
which checks whether we are the freespace inode or not and do any
specific operation in either branches of that if. This makes the code
a bit easier to understand, as an added bonus it also shrinks the
compiled size:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-17 (-17)
Function                                     old     new   delta
btrfs_delalloc_reserve_metadata             1876    1859     -17
Total: Before=85966, After=85949, chg -0.02%

No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:37 +02:00
Anand Jain
b1b8e38622 btrfs: insert newly opened device to the end of the list
Add opened device to the tail of dev_alloc_list instead of head, so that
it maintains the same order as dev_list.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:37 +02:00
Anand Jain
f8e10cd3f8 btrfs: keep device list sorted
By maintaining the device list sorted lets us reproduce the problems
related to missing chunk in the degraded mode much more consistent. So
fix this by sorting the devices by devid within the kernel. So that we
know which device is assigned to the struct fs_info::latest_bdev when
all the devices are having and same SB generation.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:36 +02:00
Liu Bo
3d5addafd0 Btrfs: do not check inode's runtime flags under root->orphan_lock
It's not necessary to hold ->orphan_lock when checking inode's runtime
flags.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:36 +02:00
Nikolay Borisov
bc5511d0ed btrfs: Use schedule_timeout_interruptible
Instead of manually fiddling with the state of the task
(RUNNING->INTERRUPTIBLE->RUNNING) again just use schedule_timeout_interruptible
which adjusts the task state as needed. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:36 +02:00
Nikolay Borisov
f9cacae314 btrfs: Move error handling of btrfs_start_dirty_block_groups closer to call site
Even though btrfs_start_dirty_block_groups is fairly in the beginning of
btrfs_commit_transaction outside of the critical section defined by the
transaction states it can only be run by a single comitter. In other
words it defines its own critical section thanks to the
BTRFS_TRANS_DIRTY_BG run flag and ro_block_group_mutex. However, its
error handling is outside of this critical section which is a bit
counter-intuitive. So move the error handling righ after the function is
executed and let the sole runner of dirty block groups handle the return
value. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:36 +02:00
Anand Jain
7ef2d6a722 btrfs: not a disk error if the bio_add_page fails
bio_add_page() can fail for logical reasons as from the bio_add_page()
comments:

/*
 * This will only fail if either bio->bi_vcnt == bio->bi_max_vecs or
 * it's a cloned bio.
 */

Here we have just allocated the bio, so both of those failures can't
occur. So drop the check. We can also drop the error stats for write
error.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:36 +02:00
Qu Wenruo
4117f207d4 btrfs: Add chunk allocation ENOSPC debug message for enospc_debug mount option
Enospc_debug makes extent allocator print more debug messages,
however for chunk allocation, there is no debug message for enospc_debug
at all.

This patch will add message for the following parts of chunk allocator:

1) No rw device at all
   Quite rare, but at least output one message for this case.

2) Not enough space for some device
   This debug message is quite handy for unbalanced disks with stripe
   based profiles (RAID0/10/5/6).

3) Not enough free devices
   This debug message should tell us if current chunk allocator is
   working correctly under minimal device requirements.

Although in most cases, we will hit other ENOSPC before we even hit a
chunk allocator ENOSPC, but such debug info won't help.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:35 +02:00
Anand Jain
566b1760b4 btrfs: use ASSERT to report logical error in cow_file_range()
Use ASSERT to report logical error in cow_file_range(), also move it a
bit closer to when the num_bytes is derived.

The extent start could be (u64)-1 in some cases, the assert should catch
that we do not accidentally pass it to cow_file_range.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:35 +02:00
Anand Jain
3752d22fce btrfs: cow_file_range() num_bytes and disk_num_bytes are same
This patch deletes local variable disk_num_bytes as its value
is same as num_bytes in the function cow_file_range().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:35 +02:00
Anand Jain
2afb9653bf btrfs: remove unused function btrfs_async_submit_limit()
Commit [1] removed the need to use btrfs_async_submit_limit(), so
delete it.

[1]
 commit 736cd52e0c
  Btrfs: remove nr_async_submits and async_submit_draining

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:35 +02:00
Yang Shi
86d750a48a btrfs: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by btrfs at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:35 +02:00
Nikolay Borisov
9a3daff320 btrfs: Add enospc_debug printing in metadata_reserve_bytes
Currently when enospc_debug mount option is turned on we do not print
any debug info in case metadata reservation failures happen. Fix this
by adding the necessary hook in reserve_metadata_bytes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:35 +02:00
Nikolay Borisov
45ae2c1841 btrfs: Document consistency of transaction->io_bgs list
The reason why io_bgs can be modified without holding any lock is
non-obvious. Document it and reference that documentation from the
respective call sites.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:34 +02:00
Nikolay Borisov
bf6d7d4900 btrfs: Remove invalid null checks from btrfs_cleanup_dirty_bgs
list_first_entry is essentially a wrapper over cotnainer_of. The latter
can never return null even if it's working on inconsistent list since it
will either crash or return some offset in the wrong struct.
Additionally, for the dirty_bgs list the iteration is done under
dirty_bgs_lock which ensures consistency of the list.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:34 +02:00
Anand Jain
8f2ceaa7b4 btrfs: log, when replace, is canceled by the user
For debugging or administration purposes, we would want to know if and
when the user cancels the replace, to complement the existing messages
when dev-replace starts or finishes.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, fold fix for RCU warning from Nikolay ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:34 +02:00
Anand Jain
acf18c56fd btrfs: fix null pointer deref when target device is missing
The replace target device can be missing when mounted with -o degraded,
but we wont allocate a missing btrfs_device to it. So check the device
before accessing.

BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
IP: btrfs_destroy_dev_replace_tgtdev+0x43/0xf0 [btrfs]
Call Trace:
btrfs_dev_replace_cancel+0x15f/0x180 [btrfs]
btrfs_ioctl+0x2216/0x2590 [btrfs]
do_vfs_ioctl+0x625/0x650
SyS_ioctl+0x4e/0x80
do_syscall_64+0x5d/0x160
entry_SYSCALL64_slow_path+0x25/0x25

This patch has been moved in front of patch "btrfs: log, when replace,
is canceled by the user" that could reproduce the crash if the system
reboots inside btrfs_dev_replace_start before the
btrfs_dev_replace_finishing call.

 $ mkfs /dev/sda
 $ mount /dev/sda mnt
 $ btrfs replace start /dev/sda /dev/sdb
 <insert reboot>
 $ mount po degraded /dev/sdb mnt
 <crash>

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ added reproducer description from mail ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:34 +02:00
Anand Jain
eceff22a80 btrfs: add a comment to mark the deprecated mount option
The options alloc_start and subvolrootid are deprecated, comment them in
the tokens list. And leave them as it is. No functional changes.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:34 +02:00
Anand Jain
d374060864 btrfs: manage commit mount option as %u
As the commit mount option is unsigned so manage it as %u for token
verifications, instead of %d.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:33 +02:00
Anand Jain
02453bdeb0 btrfs: manage check_int_print_mask mount option as %u
As check_int_print_mask mount option is unsigned so manage it as %u for
token verifications, instead of %d.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:33 +02:00
Anand Jain
764cb8b43d btrfs: manage metadata_ratio mount option as %u
As metadata_ratio mount option is unsinged so manage it as %u for token
verifications, instead of %d.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:33 +02:00
Anand Jain
f7b885befd btrfs: manage thread_pool mount option as %u
The mount option thread_pool is always unsigned. Manage it that way all
around.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:33 +02:00
Anand Jain
ba020491c8 btrfs: extent_buffer_uptodate() make it static and inline
extent_buffer_uptodate() is a trivial wrapper around test_bit() and
nothing else. So make it static and inline, save on code space and call
indirection.

Before:
   text	   data	    bss	    dec	    hex	filename
1131257	  82898	  18992	1233147	 12d0fb	fs/btrfs/btrfs.ko

After:
   text	   data	    bss	    dec	    hex	filename
1131090	  82898	  18992	1232980	 12d054	fs/btrfs/btrfs.ko

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:33 +02:00
Nikolay Borisov
70458a5819 btrfs: Remove fs_info argument of btrfs_write_and_wait_transaction
We already pass btrfs_trans_handle which contains a reference to the
fs_info so use that. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:33 +02:00
Nikolay Borisov
e9b919b1f7 btrfs: Remove fs_info argument from btrfs_update_commit_device_bytes_used
We already pass the btrfs_transaction which references fs_info so no
need to pass the later as an argument. Also use the opportunity to
shorten transaction->trans. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:32 +02:00
Nikolay Borisov
08d50ca32c btrfs: Remove fs_info argument from create_pending_snapshots/create_pending_snapshot
We already pass the trans handle which has a reference to fs_info to
create_pending_snapshot so we can refer to it directly. Doing this
obviates the need to pass the fs_info to create_pending_snapshots as
well. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:32 +02:00
Nikolay Borisov
16916a88d4 btrfs: Remove fs_info argument from switch_commit_roots
We already have the fs_info from the passed transaction so use it
directly. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:32 +02:00
Nikolay Borisov
97cb39bb91 btrfs: Remove root argument of cleanup_transaction
The only thing the passed root is used for is:
1. get a reference to the fs_info and to
2. call trace_btrfs_transaction_commit.

We can achieve 1) by simply referring to the fs_info from passed trans
object. As far as 2) is concerned cleanup_transaction is called from
only one place and the 'root' argument passed is the one from the trans
handle. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:32 +02:00
Nikolay Borisov
9386d8bc58 btrfs: Don't pass fs_info to commit_cowonly_roots
We already pass a transaction handle which refrences the fs_info so
we can grab it from there. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:32 +02:00
Nikolay Borisov
7e4443d9eb btrfs: Don't pass fs_info to commit_fs_roots
We already pass the transaction handle which has a reference to the
fs_info. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:32 +02:00
Nikolay Borisov
e5c304e651 btrfs: Don't pass fs_info to btrfs_run_delayed_items/_nr
We already pass the transaction which has a reference to the fs_info,
so use that. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:31 +02:00
Nikolay Borisov
b84acab38f btrfs: Don't pass fs_info to __btrfs_run_delayed_items
We already pass the transaction handle, which contains a refrence to
the fs_info so grab it from there. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:31 +02:00
Nikolay Borisov
2121705420 btrfs: Don't pass fs_info arg to btrfs_start_dirty_block_groups
It can be referenced from the passed transaction so no point in passing
it as a function argument. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:31 +02:00
Nikolay Borisov
6c686b359a btrfs: Remove fs_info argument from btrfs_create_pending_block_groups
It can be referenced from the passed transaciton so no point in
passing it as function argument. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:31 +02:00
Nikolay Borisov
dc60c525cf btrfs: Remove fs_info argument from btrfs_trans_release_metadata
All current callers of this function just get a reference to the
trans->fs_info member and pass it as the second argument. Collapse this
into the function itself. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:31 +02:00
Nikolay Borisov
c9b577c01a btrfs: Open code btrfs_write_and_wait_marked_extents
btrfs_write_and_wait_transaction is essentially a wrapper of
btrfs_write_and_wait_marked_extents with the addition of calling
clear_btree_io_tree. Having the code split doesn't really bring any
benefit. Open code the later into the former and add proper
documentation header.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:30 +02:00
Nikolay Borisov
0e34693f7b btrfs: Make btrfs_trans_release_metadata private to transaction.c
This function is only ever used in __btrfs_end_transaction and
btrfs_commit_transaction so there is no need to export it via header.
Let's move it closer to where it's used, make it static and remove it
from the header. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:30 +02:00
Anand Jain
15fc1283f6 btrfs: open code btrfs_init_dev_replace_tgtdev_for_resume()
btrfs_init_dev_replace_tgtdev_for_resume() initializes replace
target device in a few simple steps, so do it at the parent function.
Moreover, there isn't any other caller so just open code it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:30 +02:00
Anand Jain
18e67c73dc btrfs: btrfs_dev_replace_cancel() can return int
Current u64 return from btrfs_dev_replace_cancel() was probably done
to match the btrfs_ioctl_dev_replace_args::result. However as our
actual return value fits in int, and it further gets typecast to u64,
so just return int.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:30 +02:00
Anand Jain
17d202b973 btrfs: rename __btrfs_dev_replace_cancel()
Remove __ which is for the special functions.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:30 +02:00
Anand Jain
97282031a6 btrfs: open code btrfs_dev_replace_cancel()
btrfs_dev_replace_cancel() calls __btrfs_dev_replace_cancel() for the
actual cancel so just open code it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:29 +02:00
Nikolay Borisov
af89e0dc2c btrfs: Don't hardcode the csum size in btrfs_ordered_sum_size
Currently the function uses a hardcoded value for the checksum size of
a sector. This is fine, given that we currently support only a single
algorithm, whose checksum is 4 bytes == sizeof(u32). Despite not
having other algorithms, btrfs' design supports using a different
algorithm whith different space requirements. To future-proof the code
query the size of the currently used algorithm from the in-memory copy
of the super block. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:29 +02:00
Colin Ian King
97dc231e89 Btrfs: extent map selftest: add missing void parameter to btrfs_test_extent_map
Add a missing void parameter to function btrfs_test_extent_map, fixes
sparse warning:

warning: non-ANSI function declaration of function 'btrfs_test_extent_map'

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:29 +02:00
Colin Ian King
7a61f88088 btrfs: remove redundant check on ret and goto
The check for a non-zero ret is redundant as the goto will jump to
the very next statement anyway.  Remove this extraneous code.

Detected by CoverityScan, CID#1463784 ("Identical code for different
branches")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:29 +02:00
Nikolay Borisov
7806c6eb15 btrfs: Remove unused btrfs_start_transaction_lflush function
Commit 0e8c36a9fd ("Btrfs: fix lots of orphan inodes when the space
is not enough") changed the way transaction reservation is made in
btrfs_evict_node and as a result this function became unused. This has
been the status quo for 5 years in which time no one noticed, so I'd
say it's safe to assume it's unlikely it will ever be used again.

Historical note: there were more attempts to remove the function, the
reasoning was missing and only based on some static analysis tool
reports. Other reason for rejection was that there seemed to be
connection to BTRFS_RESERVE_FLUSH_LIMIT and that would need to be
removeed to. This was not correct so removing the function is all we can
do.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ add the note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:29 +02:00
Howard McLauchlan
b6a535faed btrfs: print error if primary super block write fails
Presently, failing a primary super block write but succeeding in at
least one super block write in general will appear to users as if
nothing important went wrong. However, upon unmounting and re-mounting,
the file system will be in a rolled back state. This was discovered
with a BCC program that uses bpf_override_return() to fail super block
writes.

This patch outputs an error clarifying that the primary super block
write has failed, so users can expect potentially erroneous behaviour.
It also forces wait_dev_supers() to return an error to its caller if
the primary super block write fails.

Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:29 +02:00
Qu Wenruo
062d4d1f40 btrfs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:28 +02:00
Liu Bo
af2679e4a7 Btrfs: enhance leak debug checker for extent state and extent buffer
This prints out eb->bflags since it contains some useful information,
e.g. whether eb is dirty.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:28 +02:00
Linus Torvalds
f36b7534b8 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "13 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm, thp: do not cause memcg oom for thp
  mm/vmscan: wake up flushers for legacy cgroups too
  Revert "mm: page_alloc: skip over regions of invalid pfns where possible"
  mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink()
  mm/thp: do not wait for lock_page() in deferred_split_scan()
  mm/khugepaged.c: convert VM_BUG_ON() to collapse fail
  x86/mm: implement free pmd/pte page interfaces
  mm/vmalloc: add interfaces to free unmapped page table
  h8300: remove extraneous __BIG_ENDIAN definition
  hugetlbfs: check for pgoff value overflow
  lockdep: fix fs_reclaim warning
  MAINTAINERS: update Mark Fasheh's e-mail
  mm/mempolicy.c: avoid use uninitialized preferred_node
2018-03-22 18:48:43 -07:00
Mike Kravetz
63489f8e82 hugetlbfs: check for pgoff value overflow
A vma with vm_pgoff large enough to overflow a loff_t type when
converted to a byte offset can be passed via the remap_file_pages system
call.  The hugetlbfs mmap routine uses the byte offset to calculate
reservations and file size.

A sequence such as:

  mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
  remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);

will result in the following when task exits/file closed,

  kernel BUG at mm/hugetlb.c:749!
  Call Trace:
    hugetlbfs_evict_inode+0x2f/0x40
    evict+0xcb/0x190
    __dentry_kill+0xcb/0x150
    __fput+0x164/0x1e0
    task_work_run+0x84/0xa0
    exit_to_usermode_loop+0x7d/0x80
    do_syscall_64+0x18b/0x190
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

The overflowed pgoff value causes hugetlbfs to try to set up a mapping
with a negative range (end < start) that leaves invalid state which
causes the BUG.

The previous overflow fix to this code was incomplete and did not take
the remap_file_pages system call into account.

[mike.kravetz@oracle.com: v3]
  Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
[akpm@linux-foundation.org: include mmdebug.h]
[akpm@linux-foundation.org: fix -ve left shift count on sh]
Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
Fixes: 045c7a3f53 ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Nic Losby <blurbdust@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22 17:07:01 -07:00
Linus Torvalds
c4f4d2f917 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Always validate XFRM esn replay attribute, from Florian Westphal.

 2) Fix RCU read lock imbalance in xfrm_get_tos(), from Xin Long.

 3) Don't try to get firmware dump if not loaded in iwlwifi, from Shaul
    Triebitz.

 4) Fix BPF helpers to deal with SCTP GSO SKBs properly, from Daniel
    Axtens.

 5) Fix some interrupt handling issues in e1000e driver, from Benjamin
    Poitier.

 6) Use strlcpy() in several ethtool get_strings methods, from Florian
    Fainelli.

 7) Fix rhlist dup insertion, from Paul Blakey.

 8) Fix SKB leak in netem packet scheduler, from Alexey Kodanev.

 9) Fix driver unload crash when link is up in smsc911x, from Jeremy
    Linton.

10) Purge out invalid socket types in l2tp_tunnel_create(), from Eric
    Dumazet.

11) Need to purge the write queue when TCP connections are aborted,
    otherwise userspace using MSG_ZEROCOPY can't close the fd. From
    Soheil Hassas Yeganeh.

12) Fix double free in error path of team driver, from Arkadi
    Sharshevsky.

13) Filter fixes for hv_netvsc driver, from Stephen Hemminger.

14) Fix non-linear packet access in ipv6 ndisc code, from Lorenzo
    Bianconi.

15) Properly filter out unsupported feature flags in macvlan driver,
    from Shannon Nelson.

16) Don't request loading the diag module for a protocol if the protocol
    itself is not even registered. From Xin Long.

17) If datagram connect fails in ipv6, make sure the socket state is
    consistent afterwards. From Paolo Abeni.

18) Use after free in qed driver, from Dan Carpenter.

19) If received ipv4 PMTU is less than the min pmtu, lock the mtu in the
    entry. From Sabrina Dubroca.

20) Fix sleep in atomic in tg3 driver, from Jonathan Toppins.

21) Fix vlan in vlan untagging in some situations, from Toshiaki Makita.

22) Fix double SKB free in genlmsg_mcast(). From Nicolas Dichtel.

23) Fix NULL derefs in error paths of tcf_*_init(), from Davide Caratti.

24) Unbalanced PM runtime calls in FEC driver, from Florian Fainelli.

25) Memory leak in gemini driver, from Igor Pylypiv.

26) IDR leaks in error paths of tcf_*_init() functions, from Davide
    Caratti.

27) Need to use GFP_ATOMIC in seg6_build_state(), from David Lebrun.

28) Missing dev_put() in error path of macsec_newlink(), from Dan
    Carpenter.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (201 commits)
  macsec: missing dev_put() on error in macsec_newlink()
  net: dsa: Fix functional dsa-loop dependency on FIXED_PHY
  hv_netvsc: common detach logic
  hv_netvsc: change GPAD teardown order on older versions
  hv_netvsc: use RCU to fix concurrent rx and queue changes
  hv_netvsc: disable NAPI before channel close
  net/ipv6: Handle onlink flag with multipath routes
  ppp: avoid loop in xmit recursion detection code
  ipv6: sr: fix NULL pointer dereference when setting encap source address
  ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel state
  net: aquantia: driver version bump
  net: aquantia: Implement pci shutdown callback
  net: aquantia: Allow live mac address changes
  net: aquantia: Add tx clean budget and valid budget handling logic
  net: aquantia: Change inefficient wait loop on fw data reads
  net: aquantia: Fix a regression with reset on old firmware
  net: aquantia: Fix hardware reset when SPI may rarely hangup
  s390/qeth: on channel error, reject further cmd requests
  s390/qeth: lock read device while queueing next buffer
  s390/qeth: when thread completes, wake up all waiters
  ...
2018-03-22 14:10:29 -07:00
Linus Torvalds
645102eac1 Just one fix for an occasional panic from Jeff Layton.
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJasXu4AAoJECebzXlCjuG+fNcP/3QYZwLdIYe0pZSIn/9V6sW1
 PwmxhTkBOEkeP4RTArI/4spZk3mrTqv7SC+ACYKbkD910uDC8cY3ElcwKxi82NL7
 9M4gNGnscKJtUye2LhoamZZLOLDIzr302pEv0IIh3v11sAu0rK1K0bqHSfI1ga/q
 r3S1sRtad+UktVH+jJ8MNE4NBfXC/xTqPz6Z+wd30CPmJKaA7Oqle7Wd3479zj0c
 A7l+EtleFo/bctHwY7yp+lVs1Yqr3ShS+qu+YAknBTnGco9JJFJsZM6x33DFMZC4
 EvFPO0gh4PuN/SkgT4zBvBlSXQjuqUWTU98ohqP7ugTtd0dXEIpR4DLLwWScA2KV
 z+YIJIKUh8Ryh8KI+J50Ed9WseVlmACHOBTJjLW7KnSYtey3+mgh5kp6Vcd9IxHc
 Gju7YgDK5FOV2JykaHQ/qCaSyL6ao1firapSrtZK4N27LpJ05FreK6qnLJtHmiwU
 pqVHyfrH4kfho1T0Nr/bw4bnugBKZVXBbBQLKp2twgJhYkUkP+vxV8QMX3h531hB
 jsQV9OIVPkdTjr5L0BmQMJ4GugM1Zr4mIunY6Mw0z5ax1JeBUelWeQccqL1YCSGL
 TZr+0N9yOuzQHX+e+mZ8vFYqfUTRF/Xfkn1+ZV3ZQU/yY8oBkDbQddaLoo1hjLUx
 2dpR3xIhzyDkP59jrK4M
 =oGAf
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.16-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fix from Bruce Fields:
 "Just one fix for an occasional panic from Jeff Layton"

* tag 'nfsd-4.16-1' of git://linux-nfs.org/~bfields/linux:
  nfsd: remove blocked locks on client teardown
2018-03-20 16:10:26 -07:00
Grygorii Strashko
2399ac42e7 sysfs: symlink: export sysfs_create_link_nowarn()
The sysfs_create_link_nowarn() is going to be used in phylib framework in
subsequent patch which can be built as module. Hence, export
sysfs_create_link_nowarn() to avoid build errors.

Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Fixes: a399546049 ("net: phy: Relax error checking on sysfs_create_link()")
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-19 21:14:26 -04:00
Jeff Layton
68ef3bc316 nfsd: remove blocked locks on client teardown
We had some reports of panics in nfsd4_lm_notify, and that showed a
nfs4_lockowner that had outlived its so_client.

Ensure that we walk any leftover lockowners after tearing down all of
the stateids, and remove any blocked locks that they hold.

With this change, we also don't need to walk the nbl_lru on nfsd_net
shutdown, as that will happen naturally when we tear down the clients.

Fixes: 76d348fadf (nfsd: have nfsd4_lock use blocking locks for v4.1+ locks)
Reported-by: Frank Sorenson <fsorenso@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@vger.kernel.org # 4.9
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-03-19 16:37:21 -04:00
Linus Torvalds
8f5fd927c3 for-4.16-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlqrzY4ACgkQxWXV+ddt
 WDtV4BAAiiv5XECNB3lIvcoFjst8E7xIJNeKKzFZh8HNm+L94zP00uqrpHkwQSUv
 tk9TGSYgbJJZMhw6I1rwYSKsoTkvP6NVMQZjpkJktWto9NXObKPvJzjKMdB+GYQR
 TWGLBaHsMhruMTOewhdVsxA5od5+LravygHaLmTDQmho8jZSSEe0qqpqlakRI9sQ
 +VZv1PDEI68IDMKMjOT7de7Ssq9c99BpGNXxvi5fy5kijyfgrs/BUfHBNV5r66MS
 y0Yw1Ab+nlLC7cE5Ql24/snDojcLoSl7f5eYt7ib+DAmsJucYQzTfUetFcrYf8IC
 EAmFs5Br6pbopi5M1IOxGZABNbr0qMS+zLvoF52cyka32nrmK/IMWl1s8bW0xfva
 gxkjouzJDOTMq28asSdyu73i9yFyLB+tEHpojECljo+K5ASzvDad34xqVRlxY6vN
 UJm4d76OqTfAyTRo7sQRWfLz/Cb1W+/v0mPotPWHWVyF0bUNsBgX5YKJhma3pU/z
 rHwhI/fqg0NTXBcNUMCosEwS2u5vnjIfiOHkDTl1Dv747S/jM0AaNEvpjDo1cOuk
 lsl4xKoDkQ6RpBCeFxRCW0wk1Nl/Su6RQ217xTtND+aIa4c6G7A9X8QAjj2XNAuK
 CktJJD7wHYSBTu1sQ+u2R/c2QEgkAxABr4NpmWTPCBNlIKEjd4A=
 =bXFe
 -----END PGP SIGNATURE-----

Merge tag 'for-4.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "There's an important revert in this pull request that needs to go to
  stable as it causes a corruption on big endian machines.

  The other fix is for FIEMAP incorrectly reporting shared extents
  before a sync and one fix for a crash in raid56.

  So far we got only one report about the BE corruption, the stable
  kernels were out for like a week, so hopefully the scope of the damage
  is low"

* tag 'for-4.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Revert "btrfs: use proper endianness accessors for super_copy"
  btrfs: add missing initialization in btrfs_check_shared
  btrfs: Fix NULL pointer exception in find_bio_stripe
2018-03-16 13:37:42 -07:00
David Sterba
093e037ca8 Revert "btrfs: use proper endianness accessors for super_copy"
This reverts commit 3c181c12c4.

The offending patch was merged in 4.16-rc4 and was promptly applied to
stable kernels 4.14.25 and 4.15.8.

The patch causes a corruption in several superblock items on big-endian
machines because of messed up endianity conversions. The damage is
manually repairable. A filesystem cannot be mounted again after it has
been unmounted once.

We do a full revert and not a fixup so stable can pick that patch ASAP.

Fixes: 3c181c12c4 ("btrfs: use proper endianness accessors for super_copy")
Link: https://lkml.kernel.org/r/1521139304@msgid.manchmal.in-ulm.de
CC: stable@vger.kernel.org # 4.14+
Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-16 14:49:44 +01:00
Linus Torvalds
df09348f78 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:

 - backport-friendly part of lock_parent() race fix

 - a fix for an assumption in the heurisic used by path_connected() that
   is not true on NFS

 - livelock fixes for d_alloc_parallel()

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Teach path_connected to handle nfs filesystems with multiple roots.
  fs: dcache: Use READ_ONCE when accessing i_dir_seq
  fs: dcache: Avoid livelock between d_alloc_parallel and __d_add
  lock_parent() needs to recheck if dentry got __dentry_kill'ed under it
2018-03-15 18:57:14 -07:00
Eric W. Biederman
95dd77580c fs: Teach path_connected to handle nfs filesystems with multiple roots.
On nfsv2 and nfsv3 the nfs server can export subsets of the same
filesystem and report the same filesystem identifier, so that the nfs
client can know they are the same filesystem.  The subsets can be from
disjoint directory trees.  The nfsv2 and nfsv3 filesystems provides no
way to find the common root of all directory trees exported form the
server with the same filesystem identifier.

The practical result is that in struct super s_root for nfs s_root is
not necessarily the root of the filesystem.  The nfs mount code sets
s_root to the root of the first subset of the nfs filesystem that the
kernel mounts.

This effects the dcache invalidation code in generic_shutdown_super
currently called shrunk_dcache_for_umount and that code for years
has gone through an additional list of dentries that might be dentry
trees that need to be freed to accomodate nfs.

When I wrote path_connected I did not realize nfs was so special, and
it's hueristic for avoiding calling is_subdir can fail.

The practical case where this fails is when there is a move of a
directory from the subtree exposed by one nfs mount to the subtree
exposed by another nfs mount.  This move can happen either locally or
remotely.  With the remote case requiring that the move directory be cached
before the move and that after the move someone walks the path
to where the move directory now exists and in so doing causes the
already cached directory to be moved in the dcache through the magic
of d_splice_alias.

If someone whose working directory is in the move directory or a
subdirectory and now starts calling .. from the initial mount of nfs
(where s_root == mnt_root), then path_connected as a heuristic will
not bother with the is_subdir check.  As s_root really is not the root
of the nfs filesystem this heuristic is wrong, and the path may
actually not be connected and path_connected can fail.

The is_subdir function might be cheap enough that we can call it
unconditionally.  Verifying that will take some benchmarking and
the result may not be the same on all kernels this fix needs
to be backported to.  So I am avoiding that for now.

Filesystems with snapshots such as nilfs and btrfs do something
similar.  But as the directory tree of the snapshots are disjoint
from one another and from the main directory tree rename won't move
things between them and this problem will not occur.

Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Fixes: 397d425dc2 ("vfs: Test for and handle paths that are unreachable from their mnt_root")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-15 18:48:38 -04:00
Edmund Nadolski
18bf591ba9 btrfs: add missing initialization in btrfs_check_shared
This patch addresses an issue that causes fiemap to falsely
report a shared extent.  The test case is as follows:

xfs_io -f -d -c "pwrite -b 16k 0 64k" -c "fiemap -v" /media/scratch/file5
sync
xfs_io  -c "fiemap -v" /media/scratch/file5

which gives the resulting output:

wrote 65536/65536 bytes at offset 0
64 KiB, 4 ops; 0.0000 sec (121.359 MiB/sec and 7766.9903 ops/sec)
/media/scratch/file5:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        24576..24703       128 0x2001
/media/scratch/file5:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        24576..24703       128   0x1

This is because btrfs_check_shared calls find_parent_nodes
repeatedly in a loop, passing a share_check struct to report
the count of shared extent. But btrfs_check_shared does not
re-initialize the count value to zero for subsequent calls
from the loop, resulting in a false share count value. This
is a regressive behavior from 4.13.

With proper re-initialization the test result is as follows:

wrote 65536/65536 bytes at offset 0
64 KiB, 4 ops; 0.0000 sec (110.035 MiB/sec and 7042.2535 ops/sec)
/media/scratch/file5:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        24576..24703       128   0x1
/media/scratch/file5:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        24576..24703       128   0x1

which corrects the regression.

Fixes: 3ec4d3238a ("btrfs: allow backref search checks for shared extents")
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
[ add text from cover letter to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-14 22:26:46 +01:00
Dmitriy Gorokh
047fdea634 btrfs: Fix NULL pointer exception in find_bio_stripe
On detaching of a disk which is a part of a RAID6 filesystem, the
following kernel OOPS may happen:

[63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
[63122.719584] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo
[63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
[63122.803516] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo
[63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
[63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo
[63122.935338] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
[63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs]
[63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0
[63122.971202] Oops: 0000 [#1] SMP
[63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W 4.14.2-16-scst34x+ #8
[63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs]
[63123.007595] task: ffff880036ea4040 task.stack: ffffc90006384000
[63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs]
[63123.007968] RSP: 0018:ffffc90006387ad8 EFLAGS: 00010287
[63123.008140] RAX: 0000000000000002 RBX: ffff88004beaa0b8 RCX: ffff8800b2bd5690
[63123.008359] RDX: 0000000000000000 RSI: ffff88007bb43500 RDI: ffff88004beaa000
[63123.008621] RBP: ffffc90006387ae8 R08: 0000000099100000 R09: ffff8800b2bd5600
[63123.008840] R10: 0000000000000004 R11: 0000000000010000 R12: ffff88007bb43500
[63123.009059] R13: 00000000fffffffb R14: ffff880036fc5180 R15: 0000000000000004
[63123.009278] FS: 0000000000000000(0000) GS:ffff8800b7000000(0000) knlGS:0000000000000000
[63123.009564] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[63123.009748] CR2: 0000000000000080 CR3: 00000000b0866000 CR4: 00000000000406f0
[63123.009969] Call Trace:
[63123.010085] raid_write_end_io+0x7e/0x80 [btrfs]
[63123.010251] bio_endio+0xa1/0x120
[63123.010378] generic_make_request+0x218/0x270
[63123.010921] submit_bio+0x66/0x130
[63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs]
[63123.011245] full_stripe_write+0x96/0xc0 [btrfs]
[63123.011428] raid56_parity_write+0x117/0x170 [btrfs]
[63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs]
[63123.011759] ? ___cache_free+0x1c5/0x300
[63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs]
[63123.012087] run_one_async_done+0x9c/0xc0 [btrfs]
[63123.012257] normal_work_helper+0x19e/0x300 [btrfs]
[63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs]
[63123.012656] process_one_work+0x14d/0x350
[63123.012888] worker_thread+0x4d/0x3a0
[63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20
[63123.013192] kthread+0x109/0x140
[63123.013315] ? process_scheduled_works+0x40/0x40
[63123.013472] ? kthread_stop+0x110/0x110
[63123.013610] ret_from_fork+0x25/0x30
[63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: ffffc90006387ad8
[63123.014678] CR2: 0000000000000080
[63123.016590] ---[ end trace a295ea7259c17880 ]—

This is reproducible in a cycle, where a series of writes is followed by
SCSI device delete command. The test may take up to few minutes.

Fixes: 74d46992e0 ("block: replace bi_bdev with a gendisk pointer and partitions index")
[ no signed-off-by provided ]
Author: Dmitriy Gorokh <Dmitriy.Gorokh@wdc.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-14 22:26:35 +01:00
Tejun Heo
d0264c01e7 fs/aio: Use RCU accessors for kioctx_table->table[]
While converting ioctx index from a list to a table, db446a08c2
("aio: convert the ioctx list to table lookup v3") missed tagging
kioctx_table->table[] as an array of RCU pointers and using the
appropriate RCU accessors.  This introduces a small window in the
lookup path where init and access may race.

Mark kioctx_table->table[] with __rcu and use the approriate RCU
accessors when using the field.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Fixes: db446a08c2 ("aio: convert the ioctx list to table lookup v3")
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org # v3.12+
2018-03-14 12:10:17 -07:00
Tejun Heo
a6d7cff472 fs/aio: Add explicit RCU grace period when freeing kioctx
While fixing refcounting, e34ecee2ae ("aio: Fix a trinity splat")
incorrectly removed explicit RCU grace period before freeing kioctx.
The intention seems to be depending on the internal RCU grace periods
of percpu_ref; however, percpu_ref uses a different flavor of RCU,
sched-RCU.  This can lead to kioctx being freed while RCU read
protected dereferences are still in progress.

Fix it by updating free_ioctx() to go through call_rcu() explicitly.

v2: Comment added to explain double bouncing.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Fixes: e34ecee2ae ("aio: Fix a trinity splat")
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org # v3.13+
2018-03-14 12:10:17 -07:00
Linus Torvalds
fc6eabbbf8 NFS client bugfixes for Linux 4.16
Hightlights include the following stable fixes:
 
 - NFS: Fix an incorrect type in struct nfs_direct_req
 - pNFS: Prevent the layout header refcount going to zero in pnfs_roc()
 - NFS: Fix unstable write completion
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaprb9AAoJEGcL54qWCgDysK0P/Rwyc7fhcs4CvxETQT0LhbbB
 P6XDAb5pBeU+ca8ZWzqlpe1uKc0z5ykuTLuxLt2QvSb9+6h+8T1gxOa5SibehnwY
 6UizLDgHMuY2V4wqm4XLHiZX7J0Evdo4rgNCp7qq2/TinBPYKQsqIz34+s9BZDti
 6LG6WiVhOtkhs6s/sRbAn+FfMbSiQn54iQIgFPeO+3zEzaaLzUZqlVHO5yjxVnxh
 6oYGkEPVzP09URUO/HJ1VTUZJnso1/axEcH9qhuJzZ+pANCpuBjSMoFzzE6oI2mD
 dfD8UJZjwqLb7AsFhgSQUrDXzaI0Xg7ccrImtCp4QLlTjja3TlOYMYbK9kDRHC1/
 kT93cSFSIeSGpqLVSdNyiz7pCh2Cps2U0JTbhwE0R1NEskWQPzSwY73jEwiA/bex
 JOZBxtjVZt8TgOKHxPO65rvQwaq3T31KrQSDLqv3JReI0epJeqng0h+pwiZluNAs
 TFjE1aP0l567A3qQnPrHrGZ4IIs8XYobZ+MBxf9x5PaRWnGoPUkhHXMP2j8JCnwa
 ofPBR4Mx5WhnsB7Z5vsYetHtmD99QbJ0+WH9ObilUFt6gUgXoXCrvexIUq+LD3DP
 HIIqfbSorAIlQxtCwEz7m85B7VEKByaCWj8OiFt5VVAreEhiazmsncsyh8R5fa88
 wJDxRe3QmNu7BEJ9NJKi
 =IQGX
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.16-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Hightlights include the following stable fixes:

   - NFS: Fix an incorrect type in struct nfs_direct_req

   - pNFS: Prevent the layout header refcount going to zero in
     pnfs_roc()

   - NFS: Fix unstable write completion"

* tag 'nfs-for-4.16-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: Fix unstable write completion
  pNFS: Prevent the layout header refcount going to zero in pnfs_roc()
  NFS: Fix an incorrect type in struct nfs_direct_req
2018-03-12 10:47:03 -07:00
Linus Torvalds
719ea86151 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "This fixes a corner case for NFS exporting (introduced in this cycle)
  as well as fixing miscellaneous bugs"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: update Kconfig texts
  ovl: redirect_dir=nofollow should not follow redirect for opaque lower
  ovl: fix ptr_ret.cocci warnings
  ovl: check ERR_PTR() return value from ovl_lookup_real()
  ovl: check lower ancestry on encode of lower dir file handle
  ovl: hash non-dir by lower inode for fsnotify
2018-03-09 09:46:14 -08:00
Linus Torvalds
2d9b1d69c3 Changes since last update:
- Fix some iomap locking problems
 - Don't allocate cow blocks when we're zeroing file data
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCgAGBQJamePeAAoJEPh/dxk0SrTryPgP/0sh8xcAFUHR0ad7D74nSp3o
 1gjBi6Yl+i+hn5SZ/gkKPa5YFjTMaefcTV8rpZm3uPYbTf75TTscjKDDczFNEqoh
 CDKOeyfOchVSjhRI5ExElBh5ToxDgh+HMzzlmxk+olfWA+BXjJj2J2iLh871v9Ym
 hQl7xOYUDPu1xZz+jLhDLKG2Gd/R97ez2KZM43gKMAUsQ7eUbz6CtV73cFT+tNfP
 jjIe3xp6ohbJYJalMThbNcI2mOOssnCM8BHUDBJib7N4CXucJYTMpDAcGvNoFKul
 +K2Icyip1/6CS4LkWDpP29ayiH+sTbZk8/ipioe27hXXHGqKG84HQOBefRZmo4o0
 CC64SoomuxQX6V7qL6hf0Sz82QQbl+ToSliZ97xXP5o84/OxhggZGqVdykTR2+DB
 4POaJzRPmYJKPO4Q0yT2worB4oepAUrridJnaQE0y9F80xnLTvJxpD53atLCHulj
 Ht9NlvnvqndDYaYjQwvucOmdR9vCos3OjZCaXnYgL+EpbgZUvMeQoxuyBUBTdQeQ
 CNGq3EDssI75gD67sWeyFc+gOST259XofANw3mvlu3KS2/huviORJkDYdtzi2/LL
 gkCN9OW9CmXOaAiylE8dSAu98SAR+c83wmahMeBA2mq5vYrQU4rIDGYZi/Mulne4
 tD7eW5nlvps9Fsvf58fd
 =T9nM
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:

 - Fix some iomap locking problems

 - Don't allocate cow blocks when we're zeroing file data

* tag 'xfs-4.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: don't block on the ilock for RWF_NOWAIT
  xfs: don't start out with the exclusive ilock for direct I/O
  xfs: don't allocate COW blocks for zeroing holes or unwritten extents
2018-03-09 09:37:29 -08:00
Trond Myklebust
c4f24df942 NFS: Fix unstable write completion
We do want to respect the FLUSH_SYNC argument to nfs_commit_inode() to
ensure that all outstanding COMMIT requests to the inode in question are
complete. Currently we may exit early from both nfs_commit_inode() and
nfs_write_inode() even if there are COMMIT requests in flight, or unstable
writes on the commit list.

In order to get the right semantics w.r.t. sync_inode(), we don't need
to have nfs_commit_inode() reset the inode dirty flags when called from
nfs_wb_page() and/or nfs_wb_all(). We just need to ensure that
nfs_write_inode() leaves them in the right state if there are outstanding
commits, or stable pages.

Reported-by: Scott Mayhew <smayhew@redhat.com>
Fixes: dc4fd9ab01 ("nfs: don't wait on commit in nfs_commit_inode()...")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-03-08 12:56:32 -05:00
Trond Myklebust
9c6376ebdd pNFS: Prevent the layout header refcount going to zero in pnfs_roc()
Ensure that we hold a reference to the layout header when processing
the pNFS return-on-close so that the refcount value does not inadvertently
go to zero.

Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.10+
Tested-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
2018-03-08 12:56:31 -05:00
Trond Myklebust
d9ee65539d NFS: Fix an incorrect type in struct nfs_direct_req
The start offset needs to be of type loff_t.

Fixed: 5fadeb47dc ("nfs: count DIO good bytes correctly with mirroring")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-03-08 12:56:31 -05:00
Andreas Gruenbacher
3b5da96e45 gfs2: Fixes to "Implement iomap for block_map" (2)
It turns out that commit 3229c18c0d6b2 'Fixes to "Implement iomap for
block_map"' introduced another bug in gfs2_iomap_begin that can cause
gfs2_block_map to set bh->b_size of an actual buffer to 0.  This can
lead to arbitrary incorrect behavior including crashes or disk
corruption.  Revert the incorrect part of that commit.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-07 11:40:38 -07:00
Miklos Szeredi
36cd95dfa1 ovl: update Kconfig texts
Add some hints about overlayfs kernel config options.

Enabling NFS export by default is especially recommended against, as it
incurs a performance penalty even if the filesystem is not actually
exported.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-03-07 11:47:15 +01:00
Linus Torvalds
af8c081627 for-4.16-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlqbuK8ACgkQxWXV+ddt
 WDsiVQ//eE8Axfw4qWOyHHjozoAIu9kifvOFQIhvKviLiXHJNrD/vBI6YwD1hyD1
 rbbLilMsEl1OD1Sq3AeOUMNSSl5qUFEB+CA8vg9GznFTNRobTkL+p96Zt5xlRDu3
 lsFFV93tED+dK4D/eSGP+xYbknA8hIk/2gWPkd6hpYKyh2QdsPBYDqCnaEXvd79P
 DIP/cAjIfzqQn0FTiZ9wbaES+LPO+NoZgQRC2w9McYQ5CEMc+oAChEmPJRwpPoKy
 YdhuwoULniRNHVnVOIVfi4w9jkSPSz7YIgLeRFli/WGBYGcKeHTMFkMa12KdpuUC
 JUclOogJ5ZMbFV3C0W8XEJ7Vb9ltIevrH8MgfKP/3BScuZzbJZ+n5KkH2Gf7vcpe
 w5cZGOsKDz+35fDCdTmsFpDK9kpGzHq48JlRifOjARbdyqNwVq4emxOeQlO1ygzq
 Y9H5UeMpp/FDAm6g/bV8ezCXzwuwUDV9CwAJBD+WCiZhD2nX85FfIp1kfF2zLcUg
 Y8irqV6A/J/0BFkF7Iu9AuzTxOdo6zMLkcHEKb+sL7yGxZ2248v5gZxDoyV5iNWR
 WY4Y2GaMemZGu6+NyyLFAJzlFyACD5fSy8vT7KD6upCzwmlAtaJ3VULfTrLyi/uM
 SNZAKf3WzKBVe5h52GNGRCi5OLTteH6Yqz5m5/h8r74mBO00IrA=
 =bF96
 -----END PGP SIGNATURE-----

Merge tag 'for-4.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - when NR_CPUS is large, a SRCU structure can significantly inflate
   size of the main filesystem structure that would not be possible to
   allocate by kmalloc, so the kvalloc fallback is used

 - improved error handling

 - fix endiannes when printing some filesystem attributes via sysfs,
   this is could happen when a filesystem is moved between different
   endianity hosts

 - send fixes: the NO_HOLE mode should not send a write operation for a
   file hole

 - fix log replay for for special files followed by file hardlinks

 - fix log replay failure after unlink and link combination

 - fix max chunk size calculation for DUP allocation

* tag 'for-4.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix log replay failure after unlink and link combination
  Btrfs: fix log replay failure after linking special file and fsync
  Btrfs: send, fix issuing write op when processing hole in no data mode
  btrfs: use proper endianness accessors for super_copy
  btrfs: alloc_chunk: fix DUP stripe size handling
  btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster
  btrfs: handle failure of add_pending_csums
  btrfs: use kvzalloc to allocate btrfs_fs_info
2018-03-04 11:04:27 -08:00
Linus Torvalds
2833419a62 A cap handling fix from Zhi that ensures that metadata writeback isn't
delayed and three error path memory leak fixups from Chengguang.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJamV3qAAoJEEp/3jgCEfOLv6kIAI0RXShR20dz9M3GnL2bDws7
 KFkC/aNwbviXMAih3q7Q7ZiwXv6BI/uUK0YIYG/tWZoNHucYGS8ZOFDikdiIMcJR
 2VP8Yas1/NvnLlh0ut45s2imvbXinkHcjWOLqgJJmJ2cEP9Ar/mbwz//TgfVMdNm
 TSHEm2LPRge9jK3Ab/ZiatExGH6dx5hRJvssSauF+f2iN+q4nKj2aBqle+FKaNWT
 Iv0ONi2f7aXnaXRvum4Y1gnKWXzKSFd8Y2Cr5R3tfTrKBv8nmQz54eRsMnnuxK/7
 uV6Ujch4LUXBI1etXLqTwRdZQIcw3Li5PYxGrO8fl8mQfJau9yv5REw/rbgB5FU=
 =f8gA
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.16-rc4' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A cap handling fix from Zhi that ensures that metadata writeback isn't
  delayed and three error path memory leak fixups from Chengguang"

* tag 'ceph-for-4.16-rc4' of git://github.com/ceph/ceph-client:
  ceph: fix potential memory leak in init_caches()
  ceph: fix dentry leak when failing to init debugfs
  libceph, ceph: avoid memory leak when specifying same option several times
  ceph: flush dirty caps of unlinked inode ASAP
2018-03-02 10:05:10 -08:00