EIO is only for the IO failure to the device, avoid it. Use ENOENT as
that's the closest error code describing what happened.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
add_missing_dev() can return device pointer so that IS_ERR/PTR_ERR can
be used to check for the actual error that occurred in the function.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
[ minor error message adjustment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Remove variables 'start' and 'end', which are set but never used.
Signed-off-by: Christos Gkekas <chris.gekas@gmail.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We now get a harmless compile-time on 32-bit architectures:
fs/btrfs/tree-checker.c: In function 'check_extent_data_item':
fs/btrfs/tree-checker.c:189:70: error: format '%lu' expects argument of type 'long unsigned int', but argument 6 has type 'unsigned int' [-Werror=format=]
This changes the format string to use %zu instead of %lu for size_t.
Fixes: c1f6520bf360 ("btrfs: tree-checker: Enhance output for check_extent_data_item")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have the combo of flushing twice, which can make sure IO
have started since the second flush will wait for page lock which
won't be unlocked unless setting page writeback and queuing ordered
extents, we don't need %async_submit_draining, %async_delalloc_pages
and %nr_async_submits to tell whether the IO has actually started.
Moreover, all the flushers in use are followed by functions that wait
for ordered extents to complete, so %nr_async_submits, which tracks
whether bio's async submit has made progress, doesn't really make
sense.
However, %async_delalloc_pages is still required by shrink_delalloc()
as that function doesn't flush twice in the normal case (just issues a
writeback with WB_REASON_FS_FREE_SPACE).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
By setting compression for a defrag task, the task will start IO at
the end of defrag.
After the combo of filemap_flush(), we've already made sure that
dirty pages have made progress via async compress thread because the
second filemap_flush() will wait for page lock, which won't be
unlocked until those pages have been marked as writeback and ordered
extents have been queued.
And this is for per-inode defrag, it's not helpful to wait on a global
%async_delalloc_pages and %nr_async_submits from fs_info.
Although waiting on %nr_async_submits means that all bios are
submitted down to per-device schedule IO lists, it doesn't wait for
their completions, thus users still need to do fsync/sync to make sure
the data is on disk. While with this change, it makes sure that pages
are marked with writeback bits and will be submitted asynchronously
shortly, therefore, the behavior of defrag option '-c' remains unchanged.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This was intended to congest higher layers to not send bios, but as
1) the congested bit has been taken by writeback
Async bios come from buffered writes and DIO writes.
For DIO writes, we want to submit them ASAP, while for buffered writes,
writeback uses balance_dirty_pages() to throttle how much dirty pages we
can have.
2) and no one is waiting for %nr_async_bios down to zero,
Historically, it was introduced along with changes which let
checksumming workload spread accross different cpus. And at that time,
pdflush was used instead of per-bdi flushing, perhaps pdflush did not
have the necessary information for writeback to do throttling.
We can safely remove them now.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
[ additional explanation from mails, removed unused variable 'limit' ]
Signed-off-by: David Sterba <dsterba@suse.com>
Output the invalid member name and its bad value, along with its
expected value range or alignment.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Output the bad value and expected good value (or its alignment).
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ unindent long strings ]
Signed-off-by: David Sterba <dsterba@suse.com>
Enhance the output to print:
1) the eason
2) the ad value, if reason is not sufficient
3) good value (range)
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording, unidented long strings ]
Signed-off-by: David Sterba <dsterba@suse.com>
Use inline function to replace macro since we don't need
stringification.
(Macro still exists until all callers get updated)
And add more info about the error, and replace EIO with EUCLEAN.
For nr_items error, report if it's too large or too small, and output
the valid value range.
For node block pointer, added a new alignment checker.
For key order, also output the next key to make the problem more
obvious.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording adjustments, unindented long strings ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's no doubt the comprehensive tree block checker will become larger,
so moving them into their own files is quite reasonable.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Remove dead assigment of num_bytes.
Also as num_bytes only used in the will_compress block as copy of
total_in just replace that with total_in and drop num_bytes entirely.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit a53f4f8e9c ("btrfs: Don't call btrfs_start_transaction() on
frozen fs to avoid deadlock.") started using internal calls and we
replace them with more suitable ones.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently struct names for sysfs are generated only based on the
attribute names. This means that attribute names cannot be reused in
multiple places throughout the complete btrfs sysfs hierarchy.
E.g. allocation/data/total_bytes and allocation/data/single/total_bytes
result in the same struct name btrfs_attr_total_bytes. A workaround for
this case was made in the past by ad hoc creating an extra macro
wrapper, BTRFS_RAID_ATTR, that inserts some extra text in the struct
name.
Instead of polluting sysfs.h with such kind of extra macro definitions,
and only doing so when there are collisions, use a prefix which gets
inserted in the struct name, so we keep everything nicely grouped
together by default.
Current collections of attributes are:
* (the toplevel, empty prefix)
* allocation
* space_info
* raid
* features
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If btrfs_transaction_commit fails it will proceed to call
cleanup_transaction, which in turn already does btrfs_abort_transaction.
So let's remove the unnecessary code duplication. Also let's be explicit
about handling failure of btrfs_uuid_tree_add by calling
btrfs_end_transaction.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_udpate_root can fail and it aborts the transaction, the correct
way to handle an aborted transaction is to explicitly end with
btrfs_end_transaction. Even now the code is correct since
btrfs_commit_transaction would handle an aborted transaction but this is
more of an implementation detail. So let's be explicit in handling
failure in btrfs_update_root.
Furthermore btrfs_commit_transaction can also fail and by ignoring it's
return value we could have left the in-memory copy of the root item in
an inconsistent state. So capture the error value which allows us to
correctly revert the RO/RW flags in case of commit failure.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_init_new_device() calls btrfs_attach_transaction() to
commit sys chunks, and it should error out if it fails.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of BUG_ON return error to the caller. And handle the fail
condition by calling the abort transaction and going through the
error path.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When new device is being added to seed FS, seed FS is marked writable,
but when we fail to bring in the new device, we missed to undo the
writable part. This patch fixes it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
EXTENT_CSUM checker is a relatively easy one, only needs to check:
1) Objectid
Fixed to BTRFS_EXTENT_CSUM_OBJECTID
2) Key offset alignment
Must be aligned to sectorsize
3) Item size alignedment
Must be aligned to csum size
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add extra checks for item with EXTENT_DATA type. This checks the
following thing:
0) Key offset
All key offsets must be aligned to sectorsize.
Inline extent must have 0 for key offset.
1) Item size
Uncompressed inline file extent size must match item size.
(Compressed inline file extent has no information about its on-disk size.)
Regular/preallocated file extent size must be a fixed value.
2) Every member of regular file extent item
Including alignment for bytenr and offset, possible value for
compression/encryption/type.
3) Type/compression/encode must be one of the valid values.
This should be the most comprehensive and strict check in the context
of btrfs_item for EXTENT_DATA.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switch to BTRFS_FILE_EXTENT_TYPES, similar to what
BTRFS_COMPRESS_TYPES does ]
Signed-off-by: David Sterba <dsterba@suse.com>
Function check_leaf() checks if any item pointer points outside of the
leaf, but it doesn't check if the pointer overlaps with the item itself.
Normally only the last item may be the victim, but adding such check is
never a bad idea anyway.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current check_leaf() function does a good job checking key order and
item offset/size.
However it only checks from slot 0 to the last but one slot, this is
good but makes later expansion hard.
So this refactoring iterates from slot 0 to the last slot.
For key comparison, it uses a key with all 0 as initial key, so all
valid keys should be larger than that.
And for item size/offset checks, it compares current item end with
previous item offset.
For slot 0, use leaf end as a special case.
This makes later item/key offset checks and item size checks easier to
be implemented.
Also, makes check_leaf() to return -EUCLEAN other than -EIO to indicate
error.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Was added in:
c8b978188c
"Btrfs: Add zlib compression support"
Survive to near time (from 08.10.2008).
Because 'start' checked for zero before branch, so it's safe to remove
that subtraction.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay reported that generic/273 was failing currently with ENOSPC.
Turns out this is because we get to the point where the outstanding
reservations are greater than the pinned space on the fs. This is a
mistake, previously we used the current reservation amount in
may_commit_transaction, not the entire outstanding reservation amount.
Fix this to find the minimum byte size needed to make progress in
flushing, and pass that into may_commit_transaction. From there we can
make a smarter decision on whether to commit the transaction or not.
This fixes the failure in generic/273.
From Nikolai, IOW: when we go to the final stage of deciding whether to
do trans commit, instead of passing all the reservations from all
tickets we just pass the reservation for the current ticket. Otherwise,
in case all reservations exceed pinned space, then we don't commit
transaction and fail prematurely. Before we passed num_bytes from
flush_space, where num_bytes was the sum of all pending reserverations,
but now all we do is take the first ticket and commit the trans if we
can satisfy that.
Fixes: 957780eb27 ("Btrfs: introduce ticketed enospc infrastructure")
Cc: stable@vger.kernel.org # 4.8
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
[ added Nikolai's comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
By analyzing the perf on btrfs send, we found it take large amount of
cpu time on page_cache_sync_readahead. This effort can be reduced after
switching to asynchronous one. Overall performance gain on HDD and SSD
were 9 and 15 percent if simply send a large file.
Signed-off-by: Kuanling Huang <peterh@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The local bio_list may have pending bios when doing cleanup, it can
end up with memory leak if they don't get freed.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Don't populate the read-only array types on the stack, instead make
it static const. Makes the object code smaller by nearly 60 bytes:
Before:
text data bss dec hex filename
90536 6552 64 97152 17b80 fs/btrfs/ioctl.o
After:
text data bss dec hex filename
90414 6616 64 97094 17b46 fs/btrfs/ioctl.o
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Forward the correct return value -ENOMEM from btrfsic_dev_state_alloc()
too.
Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ adjust changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
We've seen the following backtrace stack in ftrace or dmesg log,
kworker/u16:10-4244 [000] 241942.480955: function: btrfs_put_ordered_extent
kworker/u16:10-4244 [000] 241942.480956: kernel_stack: <stack trace>
=> finish_ordered_fn (ffffffffa0384475)
=> btrfs_scrubparity_helper (ffffffffa03ca577) <-----"incorrect"
=> btrfs_freespace_write_helper (ffffffffa03ca98e) <-----"correct"
=> process_one_work (ffffffff81117b2f)
=> worker_thread (ffffffff81118c2a)
=> kthread (ffffffff81121de0)
=> ret_from_fork (ffffffff81d7087a)
btrfs_freespace_write_helper is actually calling normal_worker_helper
instead of btrfs_scrubparity_helper, so somehow kernel has parsed the
incorrect function address while unwinding the stack,
btrfs_scrubparity_helper really shouldn't be shown up.
It's caused by compiler doing inline for our helper function, adding a
noinline tag can fix that.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use noinline_for_stack ]
Signed-off-by: David Sterba <dsterba@suse.com>
Since both committing transaction and writing log-tree are doing
plugging on metadata IO, we can unify to use %sync_writers to benefit
both cases, instead of checking bio_flags while writing meta blocks of
log-tree.
We can remove this bio_flags because in order to write dirty blocks,
log tree also uses btrfs_write_marked_extents(), inside which we
have enabled %sync_writers, therefore, every write goes in a
synchronous way, so does checksuming.
Please also note that, bio_flags is applied per-context while
%sync_writers is applied per-inode, so this might incur some overhead, ie.
1) while log tree is flushing its dirty blocks via
btrfs_write_marked_extents(), in which %sync_writers is increased
by one.
2) in the meantime, some writeback operations may happen upon btrfs's
metadata inode, so these writes go synchronously, too.
However, AFAICS, the overhead is not a big one while the win is that
we unify the two places that needs synchronous way and remove a
special hack/flag.
This removes the bio_flags related stuff for writing log-tree.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have started plug in btrfs_write_and_wait_marked_extents() but the
generated IOs actually go to device's schedule IO list where the work
is doing in another task, thus the started plug doesn't make any
sense.
And since we wait for IOs immediately after writing meta blocks, it's
the same case as writing log tree, doing sync submit can merge more
IOs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are checks on fs_info in __btrfs_panic to avoid dereferencing a
null fs_info, however, there is a call to btrfs_crit that may also
dereference a null fs_info. Fix this by adding a check to see if fs_info
is null and only print the s_id if fs_info is non-null.
Detected by CoverityScan CID#401973 ("Dereference after null check")
Fixes: efe120a067 ("Btrfs: convert printk to btrfs_ and fix BTRFS prefix")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value of variable 'can_recover' is never used after being set, thus
it should be removed, as it was never used since the first commit
68a7342c51 ("Btrfs: cleanup orphaned root orphan item").
Signed-off-by: Christos Gkekas <chris.gekas@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If 'btrfs_alloc_path()' fails, we must free the resources already
allocated, as done in the other error handling paths in this function.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__link_block_group is called from only 2 places and at each call site the
space_info being passed is the same as the space info assigned to the passed
cache struct. Let's remove the redundant argument and make the function
reference the space_info from the passed block_group_cache. No functional
changes
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ renamed to link_block_group ]
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the code executes add_extent_mapping and if it is successful
it links the new mapping, it then proceeds to unlock the extent mapping
tree and check for failure and handle them. Instead, rework the code to
only perform a single check if add_extent_mapping has failed and handle
it, otherwise the code continues in a linear fashion. No functional
changes
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduced by 5a5f79b570 ("Btrfs: allow unaligned DIO") and never
used. The buffered fallback from unaligned DIO works as expected.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_changed_cb_t represents the signature of the callback being passed
to btrfs_compare_trees. Currently there is only one such callback,
namely changed_cb in send.c. This function doesn't really uses the first
2 parameters, i.e. the roots. Since there are not other functions
implementing the btrfs_changed_cb_t let's remove the unused parameters
from the prototype and implementation.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
iterate_dir_item:found_key - introduced in 31db9f7c23 ("Btrfs:
introduce BTRFS_IOC_SEND for btrfs send/receive"), yet never used.
record_ref:num - ditto
This is a first pass with the low-hanging fruit. There are still quite a
few unsued parameters in some function which have to abide by a callback
interface.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Src was initially part of 31ff1cd25d ("Btrfs: Copy into the log tree in
big batches"), however 16e7549f04 ("Btrfs: incompatible format change
to remove hole extents") changed parameters passed to copy_items which
made the src variable redundant.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While we submit direct writes, if the inode is flagged with nodatasum,
there's no benefit to submit asynchronously, because
a) we don't have to calculate checksum across processors,
b) and direct IO has started a plug, but async submit makes us queue
IO on each device's scheduled IO list instead of DIO's plug list, so
that IOs get much less merges in general.
Lets use sync submit for nodatasum inodes.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After mapping block with BTRFS_MAP_WRITE, parities have been sorted to
the end position, so this search can start from the first parity
stripe.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied changelog as a comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
We didn't copy fsid to struct super_block.s_uuid so Overlay disables
index feature with btrfs as the lower FS.
kernel: overlayfs: fs on '/lower' does not support file handles, falling back to index=off.
Fix this by publishing the fsid through struct super_block.s_uuid.
[ dsterba: I think that setting s_uuid is the last missing bit. Overlay
needs the file handle encoding support from the lower filesystem, which
is supported. Filling the whole filesystem id is correct, the subvolume
id is encoded in the file handle buffer from inside btrfs_encode_fh. ]
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These aren't used outside of volumes.c.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some static functions are needlessly forward declared. Let's remove those
declarations since they add no value.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Both wait_for_commit() and wait_for_writer() are checking the
condition out of the mutex lock.
This refactors code a bit to be lock safe.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since TASK_UNINTERRUPTIBLE has been used here, wait_event() can do the
same job.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're still going to wait after schedule(), we don't have to do
finish_wait() to remove our %wait_queue_entry since prepare_to_wait()
won't add the same %wait_queue_entry twice.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Block layer has a limit on plug, ie. BLK_MAX_REQUEST_COUNT == 16, so
we don't gain benefits by batching 64 bios here.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQGcBAABAgAGBQJZ81hKAAoJEIosvXAHck9R7K4L/R4vPpYn19s/xPUf0fUYMOWO
JOIghfeCmmfCd2kTZF+fcDRNBpGnJjjs4ZPloxIbF7bQF0VbjrkToxthF6f9aYIJ
gt0jH1ntGUvraDpkZelTAGRj1BZou2IBzJF3Or1sL83ZX76fyXm9cJUx8Y+l2Mlx
BJMOL0Au38oRKOGnGk3GPtrflgNxe+6cTpNhLmVa9CBNDMQYjobrALgGPpbGf5h3
6l1i0IxMXuxeXjqFva0GKCjTsQSON44gNNHQoggIfHvE3nBVpSZLCwNwrVHOfd8q
4FlEXPzr3ME4WzASWqw1kAX+aij2NqbaLgDs7USkn4mUheIvZcHhC0LPVZJrZ1b4
2c3RHkOV0aZQunPJyq5vtO9B4TJC6MLcHS46iwQ6lao9hTVT8OqV7R40qmzQyt06
KwdIGObEm76J7u4lgVlAsapVKgPLOPuObQKhYUdNvRarorNaxgtadnHOhXM20G3S
PE23XvNJYwKl6SJbP97ih8Uq0//7vNCxd/khdnq4FQ==
=8LHe
-----END PGP SIGNATURE-----
Merge tag '4.14-smb3-fixes-for-stable' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Various SMB3 fixes for 4.14 and stable"
* tag '4.14-smb3-fixes-for-stable' of git://git.samba.org/sfrench/cifs-2.6:
SMB3: Validate negotiate request must always be signed
SMB: fix validate negotiate info uninitialised memory use
SMB: fix leak of validate negotiate info response buffer
CIFS: Fix NULL pointer deref on SMB2_tcon() failure
CIFS: do not send invalid input buffer on QUERY_INFO requests
cifs: Select all required crypto modules
CIFS: SMBD: Fix the definition for SMB2_CHANNEL_RDMA_V1_INVALIDATE
cifs: handle large EA requests more gracefully in smb2+
Fix encryption labels and lengths for SMB3.1.1
Pull overlayfs fixes from Miklos Szeredi:
"Fix several issues, most of them introduced in the last release"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: do not cleanup unsupported index entries
ovl: handle ENOENT on index lookup
ovl: fix EIO from lookup of non-indexed upper
ovl: Return -ENOMEM if an allocation fails ovl_lookup()
ovl: add NULL check in ovl_alloc_inode
Pull fuse fix from Miklos Szeredi:
"This fixes a longstanding bug, which can be triggered by interrupting
a directory reading syscall"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix READDIRPLUS skipping an entry
According to MS-SMB2 3.2.55 validate_negotiate request must
always be signed. Some Windows can fail the request if you send it unsigned
See kernel bugzilla bug 197311
CC: Stable <stable@vger.kernel.org>
Acked-by: Ronnie Sahlberg <lsahlber.redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
An undersize validate negotiate info server response causes the client
to use uninitialised memory for struct validate_negotiate_info_rsp
comparisons of Dialect, SecurityMode and/or Capabilities members.
Link: https://bugzilla.samba.org/show_bug.cgi?id=13092
Fixes: 7db0a6efdc ("SMB3: Work around mount failure when using SMB3 dialect to Macs")
Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Fixes: ff1c038add ("Check SMB3 dialects against downgrade attacks")
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Steve French <smfrench@gmail.com>
If SendReceive2() fails rsp is set to NULL but is dereferenced in the
error handling code.
Cc: stable@vger.kernel.org
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
query_info() doesn't use the InputBuffer field of the QUERY_INFO
request, therefore according to [MS-SMB2] it must:
a) set the InputBufferOffset to 0
b) send a zero-length InputBuffer
Doing a) is trivial but b) is a bit more tricky.
The packet is allocated according to it's StructureSize, which takes
into account an extra 1 byte buffer which we don't need
here. StructureSize fields must have constant values no matter the
actual length of the whole packet so we can't just edit that constant.
Both the NetBIOS-over-TCP message length ("rfc1002 length") L and the
iovec length L' have to be updated. Since L' is computed from L we
just update L by decrementing it by one.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Some dependencies were lost when CIFS_SMB2 was merged into CIFS.
Fixes: 2a38e12053 ("[SMB3] Remove ifdef since SMB3 (and later) now STRONGLY preferred")
Signed-off-by: Benjamin Gilbert <benjamin.gilbert@coreos.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Marios Titas running a Haskell program noticed a problem with fuse's
readdirplus: when it is interrupted by a signal, it skips one directory
entry.
The reason is that fuse erronously updates ctx->pos after a failed
dir_emit().
The issue originates from the patch adding readdirplus support.
Reported-by: Jakob Unterwurzacher <jakobunt@gmail.com>
Tested-by: Marios Titas <redneb@gmx.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 0b05b18381 ("fuse: implement NFS-like readdirplus support")
Cc: <stable@vger.kernel.org> # v3.9
sparse warns:
fs/ceph/caps.c:2042:9: warning: context imbalance in 'try_flush_caps' - wrong count at exit
We need to exit this function with the lock unlocked, but a couple of
cases leave it locked.
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
With index=on, ovl_indexdir_cleanup() tries to cleanup invalid index
entries (e.g. bad index name). This behavior could result in cleaning of
entries created by newer kernels and is therefore undesirable.
Instead, abort mount if such entries are encountered. We still cleanup
'stale' entries and 'orphan' entries, both those cases can be a result
of offline changes to lower and upper dirs.
When encoutering an index entry of type directory or whiteout, kernel
was supposed to fallback to read-only mount, but the fill_super()
operation returns EROFS in this case instead of returning success with
read-only mount flag, so mount fails when encoutering directory or
whiteout index entries. Bless this behavior by returning -EINVAL on
directory and whiteout index entries as we do for all unsupported index
entries.
Fixes: 61b674710c ("ovl: do not cleanup directory and whiteout index..")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Treat ENOENT from index entry lookup the same way as treating a returned
negative dentry. Apparently, either could be returned if file is not
found, depending on the underlying file system.
Fixes: 359f392ca5 ("ovl: lookup index entry for copy up origin")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Commit fbaf94ee3c ("ovl: don't set origin on broken lower hardlink")
attempt to avoid the condition of non-indexed upper inode with lower
hardlink as origin. If this condition is found, lookup returns EIO.
The protection of commit mentioned above does not cover the case of lower
that is not a hardlink when it is copied up (with either index=off/on)
and then lower is hardlinked while overlay is offline.
Changes to lower layer while overlayfs is offline should not result in
unexpected behavior, so a permanent EIO error after creating a link in
lower layer should not be considered as correct behavior.
This fix replaces EIO error with success in cases where upper has origin
but no index is found, or index is found that does not match upper
inode. In those cases, lookup will not fail and the returned overlay inode
will be hashed by upper inode instead of by lower origin inode.
Fixes: 359f392ca5 ("ovl: lookup index entry for copy up origin")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Apparently our current rwsem code doesn't like doing the trylock, then
lock for real scheme. So change our read/write methods to just do the
trylock for the RWF_NOWAIT case. This fixes a ~25% regression in
AIM7.
Fixes: 91f9943e ("fs: support RWF_NOWAIT for buffered reads")
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Pull vfs fixes from Al Viro:
"MS_I_VERSION fixes - Mimi's fix + missing bits picked from Matthew
(his patch contained a duplicate of the fs/namespace.c fix as well,
but by that point the original fix had already been applied)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Convert fs/*/* to SB_I_VERSION
vfs: fix mounting a filesystem with i_version
Pull key handling fixes from James Morris:
"This includes a fix for the capabilities code from Colin King, and a
set of further fixes for the keys subsystem. From David:
- Fix a bunch of places where kernel drivers may access revoked
user-type keys and don't do it correctly.
- Fix some ecryptfs bits.
- Fix big_key to require CONFIG_CRYPTO.
- Fix a couple of bugs in the asymmetric key type.
- Fix a race between updating and finding negative keys.
- Prevent add_key() from updating uninstantiated keys.
- Make loading of key flags and expiry time atomic when not holding
locks"
* 'fixes-v4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
commoncap: move assignment of fs_ns to avoid null pointer dereference
pkcs7: Prevent NULL pointer dereference, since sinfo is not always set.
KEYS: load key flags and expiry time atomically in proc_keys_show()
KEYS: Load key expiry time atomically in keyring_search_iterator()
KEYS: load key flags and expiry time atomically in key_validate()
KEYS: don't let add_key() update an uninstantiated key
KEYS: Fix race between updating and finding a negative key
KEYS: checking the input id parameters before finding asymmetric key
KEYS: Fix the wrong index when checking the existence of second id
security/keys: BIG_KEY requires CONFIG_CRYPTO
ecryptfs: fix dereference of NULL user_key_payload
fscrypt: fix dereference of NULL user_key_payload
lib/digsig: fix dereference of NULL user_key_payload
FS-Cache: fix dereference of NULL user_key_payload
KEYS: encrypted: fix dereference of NULL user_key_payload
This introduces a "register private expedited" membarrier command which
allows eventual removal of important memory barrier constraints on the
scheduler fast-paths. It changes how the "private expedited" membarrier
command (new to 4.14) is used from user-space.
This new command allows processes to register their intent to use the
private expedited command. This affects how the expedited private
command introduced in 4.14-rc is meant to be used, and should be merged
before 4.14 final.
Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
This fixes a problem that arose when designing requested extensions to
sys_membarrier() to allow JITs to efficiently flush old code from
instruction caches. Several potential algorithms are much less painful
if the user register intent to use this functionality early on, for
example, before the process spawns the second thread. Registering at
this time removes the need to interrupt each and every thread in that
process at the first expedited sys_membarrier() system call.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The error code is missing here so it means we return ERR_PTR(0) or NULL.
The other error paths all return an error code so this probably should
as well.
Fixes: 02b69b284c ("ovl: lookup redirects")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This was detected by fault injection test
Signed-off-by: Hirofumi Nakagawa <nklabs@gmail.com>
Fixes: 13cf199d00 ("ovl: allocate an ovl_inode struct")
Cc: <stable@vger.kernel.org> # v4.13
[AV: in addition to the fix in previous commit]
Signed-off-by: Matthew Garrett <mjg59@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- fix some more CONFIG_XFS_RT related build problems
- fix data loss when writeback at eof races eofblocks gc and loses
- invalidate page cache after fs finishes a dio write
- remove dirty page state when invalidating pages so releasepage does
the right thing when handed a dirty page
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJZ5jqbAAoJEPh/dxk0SrTrtfMP/jcQ6lTDcpnQ7XEP2fg2dXjx
2+z8uI7Mjr5wo2qfIWHc8nZHZ+8KRak4U28rTlrXkeVbJ79x3Z+SzeipP76dGHXB
u9MD7uacTD6BDT7R8/bux7g7KrPATVJYJiT3PRHZ5ysUT6i9KnREdbaKpgOwhMcI
Ivd9ROZHx62CmZhsbfLzD+Ccy9/mGBR5OmT8nQlsuD8cEcFU5u1afaJ2/YlCjNLN
c16Q8dhGXed7tjduiYCzsxDiewJMzSfcGdyk6yCwXdR3zcI3RdhXUN5FRH0R9GB2
xxG1n5Q4qgtgODGgcPUl9WG8mfhVvEcuZGioxChQrxCEcaHt1Waop0fOixLy9J3Q
lUn4qjA5S+VBqa6XsKCSCkiZdDtncSedvMRQYef09q8DGAouwAtN/Z3BVM24oyWU
k5888Gt4EHZK6V3lz3qPMmGFxfuPL6GeyEvIYUezpVIYsmp0sLQTeNFUW+XC7fb/
tOBNom4ARHFmSb5da7uwJvesNZBVFSpFQtxkcx1OL0rhTqlKIfPP61dLznKhqUTL
2NhaFjnznYenSEK2CsP+V3CtQrCxywdqDNnOEgTgKJbWPpsYMX63z/Cmtm0A7Qdz
BAbGc+OSBLqelwsWNnNzTWPHk33SKxtIxGTe8gKbKbrzbR7mxyJxHKEwpZvWIqh+
8eTdgJb1wgJyqtBsTSHN
=UY00
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.14-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- fix some more CONFIG_XFS_RT related build problems
- fix data loss when writeback at eof races eofblocks gc and loses
- invalidate page cache after fs finishes a dio write
- remove dirty page state when invalidating pages so releasepage does
the right thing when handed a dirty page
* tag 'xfs-4.14-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: move two more RT specific functions into CONFIG_XFS_RT
xfs: trim writepage mapping to within eof
fs: invalidate page cache after end_io() in dio completion
xfs: cancel dirty pages on invalidation
Pull block fixes from Jens Axboe:
"Three small fixes:
- A fix for skd, it was using kfree() to free a structure allocate
with kmem_cache_alloc().
- Stable fix for nbd, fixing a regression using the normal ioctl
based tools.
- Fix for a previous fix in this series, that fixed up
inconsistencies between buffered and direct IO"
* 'for-linus' of git://git.kernel.dk/linux-block:
fs: Avoid invalidation in interrupt context in dio_complete()
nbd: don't set the device size until we're connected
skd: Use kmem_cache_free
The channel value for requesting server remote invalidating local memory
registration should be 0x00000002
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Update reading the EA using increasingly larger buffer sizes
until the response will fit in the buffer, or we exceed the
(arbitrary) maximum set to 64kb.
Without this change, a user is able to add more and more EAs using
setfattr until the point where the total space of all EAs exceed 2kb
at which point the user can no longer list the EAs at all
and getfattr will abort with an error.
The same issue still exists for EAs in SMB1.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
SMB3.1.1 is most secure and recent dialect. Fixup labels and lengths
for sMB3.1.1 signing and encryption.
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Currently we try to defer completion of async DIO to the process context
in case there are any mapped pages associated with the inode so that we
can invalidate the pages when the IO completes. However the check is racy
and the pages can be mapped afterwards. If this happens we might end up
calling invalidate_inode_pages2_range() in dio_complete() in interrupt
context which could sleep. This can be reproduced by generic/451.
Fix this by passing the information whether we can or can't invalidate
to the dio_complete(). Thanks Eryu Guan for reporting this and Jan Kara
for suggesting a fix.
Fixes: 332391a993 ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
Reported-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The mount i_version flag is not enabled in the new sb_flags. This patch
adds the missing SB_I_VERSION flag.
Fixes: e462ec5 "VFS: Differentiate mount flags (MS_*) from internal
superblock flags"
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The last cleanup introduced two harmless warnings:
fs/xfs/xfs_fsmap.c:480:1: warning: '__xfs_getfsmap_rtdev' defined but not used
fs/xfs/xfs_fsmap.c:372:1: warning: 'xfs_getfsmap_rtdev_rtbitmap_helper' defined but not used
This moves those two functions as well.
Fixes: bb9c2e5433 ("xfs: move more RT specific code under CONFIG_XFS_RT")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The writeback rework in commit fbcc025613 ("xfs: Introduce
writeback context for writepages") introduced a subtle change in
behavior with regard to the block mapping used across the
->writepages() sequence. The previous xfs_cluster_write() code would
only flush pages up to EOF at the time of the writepage, thus
ensuring that any pages due to file-extending writes would be
handled on a separate cycle and with a new, updated block mapping.
The updated code establishes a block mapping in xfs_writepage_map()
that could extend beyond EOF if the file has post-eof preallocation.
Because we now use the generic writeback infrastructure and pass the
cached mapping to each writepage call, there is no implicit EOF
limit in place. If eofblocks trimming occurs during ->writepages(),
any post-eof portion of the cached mapping becomes invalid. The
eofblocks code has no means to serialize against writeback because
there are no pages associated with post-eof blocks. Therefore if an
eofblocks trim occurs and is followed by a file-extending buffered
write, not only has the mapping become invalid, but we could end up
writing a page to disk based on the invalid mapping.
Consider the following sequence of events:
- A buffered write creates a delalloc extent and post-eof
speculative preallocation.
- Writeback starts and on the first writepage cycle, the delalloc
extent is converted to real blocks (including the post-eof blocks)
and the mapping is cached.
- The file is closed and xfs_release() trims post-eof blocks. The
cached writeback mapping is now invalid.
- Another buffered write appends the file with a delalloc extent.
- The concurrent writeback cycle picks up the just written page
because the writeback range end is LLONG_MAX. xfs_writepage_map()
attributes it to the (now invalid) cached mapping and writes the
data to an incorrect location on disk (and where the file offset is
still backed by a delalloc extent).
This problem is reproduced by xfstests test generic/464, which
triggers racing writes, appends, open/closes and writeback requests.
To address this problem, trim the mapping used during writeback to
within EOF when the mapping is validated. This ensures the mapping
is revalidated for any pages encountered beyond EOF as of the time
the current mapping was cached or last validated.
Reported-by: Eryu Guan <eguan@redhat.com>
Diagnosed-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Commit 332391a993 ("fs: Fix page cache inconsistency when mixing
buffered and AIO DIO") moved page cache invalidation from
iomap_dio_rw() to iomap_dio_complete() for iomap based direct write
path, but before the dio->end_io() call, and it re-introdued the bug
fixed by commit c771c14baa ("iomap: invalidate page caches should
be after iomap_dio_complete() in direct write").
I found this because fstests generic/418 started failing on XFS with
v4.14-rc3 kernel, which is the regression test for this specific
bug.
So similarly, fix it by moving dio->end_io() (which does the
unwritten extent conversion) before page cache invalidation, to make
sure next buffer read reads the final real allocations not unwritten
extents. I also add some comments about why should end_io() go first
in case we get it wrong again in the future.
Note that, there's no such problem in the non-iomap based direct
write path, because we didn't remove the page cache invalidation
after the ->direct_IO() in generic_file_direct_write() call, but I
decided to fix dio_complete() too so we don't leave a landmine
there, also be consistent with iomap_dio_complete().
Fixes: 332391a993 ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Recently we've had warnings arise from the vm handing us pages
without bufferheads attached to them. This should not ever occur
in XFS, but we don't defend against it properly if it does. The only
place where we remove bufferheads from a page is in
xfs_vm_releasepage(), but we can't tell the difference here between
"page is dirty so don't release" and "page is dirty but is being
invalidated so release it".
In some places that are invalidating pages ask for pages to be
released and follow up afterward calling ->releasepage by checking
whether the page was dirty and then aborting the invalidation. This
is a possible vector for releasing buffers from a page but then
leaving it in the mapping, so we really do need to avoid dirty pages
in xfs_vm_releasepage().
To differentiate between invalidated pages and normal pages, we need
to clear the page dirty flag when invalidating the pages. This can
be done through xfs_vm_invalidatepage(), and will result
xfs_vm_releasepage() seeing the page as clean which matches the
bufferhead state on the page after calling block_invalidatepage().
Hence we can re-add the page dirty check in xfs_vm_releasepage to
catch the case where we might be releasing a page that is actually
dirty and so should not have the bufferheads on it removed. This
will remove one possible vector of "dirty page with no bufferheads"
and so help narrow down the search for the root cause of that
problem.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
inode->i_private is assigned by a Node pointer only after registering a
new binary format, so it could be NULL if inode was created by
bm_fill_super() (or iput() was called by the error path in
bm_register_write()), and this could result in NULL pointer dereference
when evicting such an inode. e.g. mount binfmt_misc filesystem then
umount it immediately:
mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc
umount /proc/sys/fs/binfmt_misc
will result in
BUG: unable to handle kernel NULL pointer dereference at 0000000000000013
IP: bm_evict_inode+0x16/0x40 [binfmt_misc]
...
Call Trace:
evict+0xd3/0x1a0
iput+0x17d/0x1d0
dentry_unlink_inode+0xb9/0xf0
__dentry_kill+0xc7/0x170
shrink_dentry_list+0x122/0x280
shrink_dcache_parent+0x39/0x90
do_one_tree+0x12/0x40
shrink_dcache_for_umount+0x2d/0x90
generic_shutdown_super+0x1f/0x120
kill_litter_super+0x29/0x40
deactivate_locked_super+0x43/0x70
deactivate_super+0x45/0x60
cleanup_mnt+0x3f/0x70
__cleanup_mnt+0x12/0x20
task_work_run+0x86/0xa0
exit_to_usermode_loop+0x6d/0x99
syscall_return_slowpath+0xba/0xf0
entry_SYSCALL_64_fastpath+0xa3/0xa
Fix it by making sure Node (e) is not NULL.
Link: http://lkml.kernel.org/r/20171010100642.31786-1-eguan@redhat.com
Fixes: 83f918274e ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
Signed-off-by: Eryu Guan <eguan@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When using FAT on a block device which supports rw_page, we can hit
BUG_ON(!PageLocked(page)) in try_to_free_buffers(). This is because we
call clean_buffers() after unlocking the page we've written. Introduce
a new clean_page_buffers() which cleans all buffers associated with a
page and call it from within bdev_write_page().
[akpm@linux-foundation.org: s/PAGE_SIZE/~0U/ per Linus and Matthew]
Link: http://lkml.kernel.org/r/20171006211541.GA7409@bombadil.infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Reported-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Tested-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix a stale kernel memory exposure when logging inodes.
- Fix some build problems with CONFIG_XFS_RT=n
- Don't change inode mode if the acl write fails, leaving the file totally
inaccessible.
- Fix a dangling pointer problem when removing an attr fork under memory
pressure.
- Don't crash while trying to invalidate a null buffer associated with a
corrupt metadata pointer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJZ3lPiAAoJEPh/dxk0SrTrfuMP/Axy7VSX71tE/eXPOmzxCVZD
w4/usqO+OsQj+q8o+rwwuX9hz0VGF8kWZJOdgGdXpYT7pWqPmcf88wbThheTetLF
fjevusqva0Ds+U4AE7DCNWSKQQRhu2jDgnhQXTv1hdYhWIF59qGwioIijbEvb72I
0QW+/uV9yXmODjWL6KfRh9zRT9N4npMtszukScONwJr9t0/5ub8H03H/ktv8T9oi
C3ljEWwyMk5lEYH8p6tpta8EbY0mrIZgo+kj33PU5s9rHvcrTGtyPNqidREUm1fL
X3+STMytcDQFAcZdBBXHN0nFMwa8ADTrVvKmEgaR8OsXmOmrlcPn7HfVVlWrY31w
X3awJ0b0+IXUrsbbQOPeqgTo5hIkMDkMOga5AP/rqpx1yCCOrlMHaRPXB2NxNcVw
dyTj6IpKybhsQ4GkcqmFcgnxPPaogNpYlp6SXV5Dm+8zEJdIQNUuci/EGsNz7UcV
msxNlJJkxczXOew6JzCyw45wTnJCxduX7Y1xrOTLaDfa9pkWO2zQBXukCJNIqVIq
35Q4P4JVYtmwQr8XkkX9tiqU0gBWTCTG9KjmTCMm5MYkutEYM0uTNR5Jvyiobl7L
Nn+RydssVw7ssnNfgsLhzQHPElUivRdYoYFSBa2DQp6ViILrefqQegd5INAjK63W
7vnHVZyJMHPM0YFoiX8w
=6Yvh
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.14-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- Fix a stale kernel memory exposure when logging inodes.
- Fix some build problems with CONFIG_XFS_RT=n
- Don't change inode mode if the acl write fails, leaving the file
totally inaccessible.
- Fix a dangling pointer problem when removing an attr fork under
memory pressure.
- Don't crash while trying to invalidate a null buffer associated with
a corrupt metadata pointer.
* tag 'xfs-4.14-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: handle error if xfs_btree_get_bufs fails
xfs: reinit btree pointer on attr tree inactivation walk
xfs: Fix bool initialization/comparison
xfs: don't change inode mode if ACL update fails
xfs: move more RT specific code under CONFIG_XFS_RT
xfs: Don't log uninitialised fields in inode structures
Pull quota fix from Jan Kara:
"A fix for a regression in handling of quota grace times and warnings"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: Generate warnings for DQUOT_SPACE_NOFAIL allocations
In eCryptfs, we failed to verify that the authentication token keys are
not revoked before dereferencing their payloads, which is problematic
because the payload of a revoked key is NULL. request_key() *does* skip
revoked keys, but there is still a window where the key can be revoked
before we acquire the key semaphore.
Fix it by updating ecryptfs_get_key_payload_data() to return
-EKEYREVOKED if the key payload is NULL. For completeness we check this
for "encrypted" keys as well as "user" keys, although encrypted keys
cannot be revoked currently.
Alternatively we could use key_validate(), but since we'll also need to
fix ecryptfs_get_key_payload_data() to validate the payload length, it
seems appropriate to just check the payload pointer.
Fixes: 237fead619 ("[PATCH] ecryptfs: fs/Makefile and fs/Kconfig")
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: <stable@vger.kernel.org> [v2.6.19+]
Cc: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
When an fscrypt-encrypted file is opened, we request the file's master
key from the keyrings service as a logon key, then access its payload.
However, a revoked key has a NULL payload, and we failed to check for
this. request_key() *does* skip revoked keys, but there is still a
window where the key can be revoked before we acquire its semaphore.
Fix it by checking for a NULL payload, treating it like a key which was
already revoked at the time it was requested.
Fixes: 88bd6ccdcd ("ext4 crypto: add encryption key management facilities")
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: <stable@vger.kernel.org> [v4.1+]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
When the file /proc/fs/fscache/objects (available with
CONFIG_FSCACHE_OBJECT_LIST=y) is opened, we request a user key with
description "fscache:objlist", then access its payload. However, a
revoked key has a NULL payload, and we failed to check for this.
request_key() *does* skip revoked keys, but there is still a window
where the key can be revoked before we access its payload.
Fix it by checking for a NULL payload, treating it like a key which was
already revoked at the time it was requested.
Fixes: 4fbf4291aa ("FS-Cache: Allow the current state of all objects to be dumped")
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: <stable@vger.kernel.org> [v2.6.32+]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Jason reported that a corrupted filesystem failed to replay
the log with a metadata block out of bounds warning:
XFS (dm-2): _xfs_buf_find: Block out of range: block 0x80270fff8, EOFS 0x9c40000
_xfs_buf_find() and xfs_btree_get_bufs() return NULL if
that happens, and then when xfs_alloc_fix_freelist() calls
xfs_trans_binval() on that NULL bp, we oops with:
BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8
We don't handle _xfs_buf_find errors very well, every
caller higher up the stack gets to guess at why it failed.
But we should at least handle it somehow, so return
EFSCORRUPTED here.
Reported-by: Jason L Tibbitts III <tibbs@math.uh.edu>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_attr3_root_inactive() walks the attr fork tree to invalidate the
associated blocks. xfs_attr3_node_inactive() recursively descends
from internal blocks to leaf blocks, caching block address values
along the way to revisit parent blocks, locate the next entry and
descend down that branch of the tree.
The code that attempts to reread the parent block is unsafe because
it assumes that the local xfs_da_node_entry pointer remains valid
after an xfs_trans_brelse() and re-read of the parent buffer. Under
heavy memory pressure, it is possible that the buffer has been
reclaimed and reallocated by the time the parent block is reread.
This means that 'btree' can point to an invalid memory address, lead
to a random/garbage value for child_fsb and cause the subsequent
read of the attr fork to go off the rails and return a NULL buffer
for an attr fork offset that is most likely not allocated.
Note that this problem can be manufactured by setting
XFS_ATTR_BTREE_REF to 0 to prevent LRU caching of attr buffers,
creating a file with a multi-level attr fork and removing it to
trigger inactivation.
To address this problem, reinit the node/btree pointers to the
parent buffer after it has been re-read. This ensures btree points
to a valid record and allows the walk to proceed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we get ENOSPC half way through setting the ACL, the inode mode
can still be changed even though the ACL does not exist. Reorder the
operation to only change the mode of the inode if the ACL is set
correctly.
Whilst this does not fix the problem with crash consistency (that requires
attribute addition to be a deferred op) it does prevent ENOSPC and other
non-fatal errors setting an xattr to be handled sanely.
This fixes xfstests generic/449.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>