Pull btrfs fixes from David Sterba:
"We've identified and fixed a silent corruption (introduced by code in
the first pull), a fixup after the blk_status_t merge and two fixes to
incremental send that Filipe has been hunting for some time"
* 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix unexpected return value of bio_readpage_error
btrfs: btrfs_create_repair_bio never fails, skip error handling
btrfs: cloned bios must not be iterated by bio_for_each_segment_all
Btrfs: fix write corruption due to bio cloning on raid5/6
Btrfs: incremental send, fix invalid memory access
Btrfs: incremental send, fix invalid path for link commands
We've started using cloned bios more in 4.13, there are some specifics
regarding the iteration. Filipe found [1] that the raid56 iterated a
cloned bio using bio_for_each_segment_all, which is incorrect. The
cloned bios have wrong bi_vcnt and this could lead to silent
corruptions. This patch adds assertions to all remaining
bio_for_each_segment_all cases.
[1] https://patchwork.kernel.org/patch/9838535/
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs updates from David Sterba:
"The core updates improve error handling (mostly related to bios), with
the usual incremental work on the GFP_NOFS (mis)use removal,
refactoring or cleanups. Except the two top patches, all have been in
for-next for an extensive amount of time.
User visible changes:
- statx support
- quota override tunable
- improved compression thresholds
- obsoleted mount option alloc_start
Core updates:
- bio-related updates:
- faster bio cloning
- no allocation failures
- preallocated flush bios
- more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates
- prep work for btree_inode removal
- dir-item validation
- qgoup fixes and updates
- cleanups:
- removed unused struct members, unused code, refactoring
- argument refactoring (fs_info/root, caller -> callee sink)
- SEARCH_TREE ioctl docs"
* 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
btrfs: Remove false alert when fiemap range is smaller than on-disk extent
btrfs: Don't clear SGID when inheriting ACLs
btrfs: fix integer overflow in calc_reclaim_items_nr
btrfs: scrub: fix target device intialization while setting up scrub context
btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
btrfs: qgroup: Return actually freed bytes for qgroup release or free data
btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
btrfs: qgroup: Add quick exit for non-fs extents
Btrfs: rework delayed ref total_bytes_pinned accounting
Btrfs: return old and new total ref mods when adding delayed refs
Btrfs: always account pinned bytes when dropping a tree block ref
Btrfs: update total_bytes_pinned when pinning down extents
Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
btrfs: fix validation of XATTR_ITEM dir items
btrfs: Verify dir_item in iterate_object_props
btrfs: Check name_len before in btrfs_del_root_ref
btrfs: Check name_len before reading btrfs_get_name
...
This function is supposed to return blk_status_t error codes now but
there was a stray -ENOMEM left behind.
Fixes: 4e4cbee93d ("block: switch bios to blk_status_t")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Most callers of btrfs_bio_alloc convert from bytes to sectors. Hide that
in the helper and simplify the logic in the callsers.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
compressed_bio_alloc is now a trivial wrapper around btrfs_bio_alloc, no
point keeping it. The error handling can be simplified, as we know
btrfs_bio_alloc will never fail.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass gfp_flags=GFP_NOFS and nr_vecs=BIO_MAX_PAGES.
submit_extent_page adds __GFP_HIGH that does not make a difference in
our case as it allows access to memory reserves but otherwise does not
change the constraints.
Signed-off-by: David Sterba <dsterba@suse.com>
The workspaces are preallocated at the beginning where we can safely use
GFP_KERNEL, but in some cases the find_workspace might reach the
allocation again, now in a more restricted context when the bios or
pages are being compressed.
To avoid potential lockup when alloc_workspace -> vmalloc would silently
use the GFP_KERNEL, add the memalloc_nofs helpers around the critical
call site.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
struct compressed_bio pointer can be used instead.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of sending each argument of struct compressed_bio, send
the compressed_bio itself.
Also by having struct compressed_bio in btrfs_decompress_bio()
it would help tracing.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value of max_out can be calculated from the parameters passed to the
compressors, which is number of pages and the page size, and we don't
have to needlessly pass it around.
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter saying how many pages can be allocated at maximum can be
merged with the output page counter, to save some stack space. The
compression implementation will sink the parameter to a local variable
so everything works as before.
The nr_pages variables can also be simply merged in compress_file_range
into one.
Signed-off-by: David Sterba <dsterba@suse.com>
The length parameter is basically duplicated for input and output in the
top level caller of the compress_pages chain. We can simply use one
variable for that and reduce stack consumption. The compression
implementation will sink the parameter to a local variable so everything
works as before.
Signed-off-by: David Sterba <dsterba@suse.com>
The original csum error message only outputs inode number, offset, check
sum and expected check sum.
However no root objectid is outputted, which sometimes makes debugging
quite painful under multi-subvolume case (including relocation).
Also the checksum output is decimal, which seldom makes sense for
users/developers and is hard to read in most time.
This patch will add root objectid, which will be %lld for rootid larger
than LAST_FREE_OBJECTID, and hex csum output for better readability.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_ino takes a struct inode and this causes a lot of
internal btrfs functions which consume this ino to take a VFS inode,
rather than btrfs' own struct btrfs_inode. In order to fix this "leak"
of VFS structs into the internals of btrfs first it's necessary to
eliminate all uses of struct inode for the purpose of inode. This patch
does that by using BTRFS_I to convert an inode to btrfs_inode. With
this problem eliminated subsequent patches will start eliminating the
passing of struct inode altogether, eventually resulting in a lot cleaner
code.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
[ fix btrfs_get_extent tracepoint prototype ]
Signed-off-by: David Sterba <dsterba@suse.com>
If btrfs_decompress_buf2page() is handed a bio with its page in the
middle of the working buffer, then we adjust the offset into the working
buffer. After we copy into the bio, we advance the iterator by the
number of bytes we copied. Then, we have some logic to handle the case
of discontiguous pages and adjust the offset into the working buffer
again. However, if we didn't advance the bio to a new page, we may enter
this case in error, essentially repeating the adjustment that we already
made when we entered the function. The end result is bogus data in the
bio.
Previously, we only checked for this case when we advanced to a new
page, but the conversion to bio iterators changed that. This restores
the old, correct behavior.
A case I saw when testing with zlib was:
buf_start = 42769
total_out = 46865
working_bytes = total_out - buf_start = 4096
start_byte = 45056
The condition (total_out > start_byte && buf_start < start_byte) is
true, so we adjust the offset:
buf_offset = start_byte - buf_start = 2287
working_bytes -= buf_offset = 1809
current_buf_start = buf_start = 42769
Then, we copy
bytes = min(bvec.bv_len, PAGE_SIZE - buf_offset, working_bytes) = 1809
buf_offset += bytes = 4096
working_bytes -= bytes = 0
current_buf_start += bytes = 44578
After bio_advance(), we are still in the same page, so start_byte is the
same. Then, we check (total_out > start_byte && current_buf_start < start_byte),
which is true! So, we adjust the values again:
buf_offset = start_byte - buf_start = 2287
working_bytes = total_out - start_byte = 1809
current_buf_start = buf_start + buf_offset = 45056
But note that working_bytes was already zero before this, so we should
have stopped copying.
Fixes: 974b1adc3b ("btrfs: use bio iterators for the decompression handlers")
Reported-by: Pat Erley <pat-lkml@erley.org>
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Liu Bo <bo.li.liu@oracle.com>
There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer. Let's convert those to
just accept an fs_info pointer directly.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable. This makes the code considerably
more readable.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We track the node sizes per-root, but they never vary from the values
in the superblock. This patch messes with the 80-column style a bit,
but subsequent patches to factor out root->fs_info into a convenience
variable fix it up again.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the bvec offset and len members to prepare for multipage bvecs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using bi_vcnt to calculate it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass the full bio to the decompression routines and use bio iterators
to iterate over the data in the bio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
csum member of struct btrfs_super_block has array type of u8. It makes
sense that function btrfs_csum_final should be also declared to accept
u8 *. I changed the declaration of method void btrfs_csum_final(u32 crc,
char *result); to void btrfs_csum_final(u32 crc, u8 *result);
Signed-off-by: Domagoj Tršan <domagoj.trsan@gmail.com>
[ changed cast to u8 at several call sites ]
Signed-off-by: David Sterba <dsterba@suse.com>
For many printks, we want to know which file system issued the message.
This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.
fs/btrfs/check-integrity.c is left alone for another day.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch converts printk(KERN_* style messages to use the pr_* versions.
One side effect is that anything that was KERN_DEBUG is now automatically
a dynamic debug message.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull more btrfs updates from Chris Mason:
"This is part two of my btrfs pull, which is some cleanups and a batch
of fixes.
Most of the code here is from Jeff Mahoney, making the pointers we
pass around internally more consistent and less confusing overall. I
noticed a small problem right before I sent this out yesterday, so I
fixed it up and re-tested overnight"
* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (40 commits)
Btrfs: fix __MAX_CSUM_ITEMS
btrfs: btrfs_abort_transaction, drop root parameter
btrfs: add btrfs_trans_handle->fs_info pointer
btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction
btrfs: convert nodesize macros to static inlines
btrfs: introduce BTRFS_MAX_ITEM_SIZE
btrfs: cleanup, remove prototype for btrfs_find_root_ref
btrfs: copy_to_sk drop unused root parameter
btrfs: simpilify btrfs_subvol_inherit_props
btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root
btrfs: tests, require fs_info for root
btrfs: tests, move initialization into tests/
btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
btrfs: prefix fsid to all trace events
btrfs: plumb fs_info into btrfs_work
btrfs: remove obsolete part of comment in statfs
btrfs: hide test-only member under ifdef
btrfs: Ratelimit "no csum found" info message
btrfs: Add ratelimit to btrfs printing
Btrfs: fix unexpected balance crash due to BUG_ON
...
This is similar to btrfs_submit_compressed_read(), if we fail after
bio is allocated, then we can use bio_endio() and errors are saved
in bio->bi_error. But please note that we don't return errors to
its caller because the caller assumes it won't call endio to cleanup
on error.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is
no need to pass around the rq_flag_bits bits too. btrfs users should
should access the bio insead.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This should be the easier cases to convert btrfs to
bio_set_op_attrs/bio_op.
They are mostly just cut and replace type of changes.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
With just one preallocated workspace we can guarantee forward progress
even if there's no memory available for new workspaces. The cost is more
waiting but we also get rid of several error paths.
On average, there will be several idle workspaces, so the waiting
penalty won't be so bad.
In the worst case, all cpus will compete for one workspace until there's
some memory. Attempts to allocate a new one are done each time the
waiters are woken up.
Signed-off-by: David Sterba <dsterba@suse.com>
Preallocate one workspace for each compression type so we can guarantee
forward progress in the worst case. A failure cannot be a hard error as
we might not use compression at all on the filesystem. If we can't
allocate the workspaces later when need them, it might actually
deadlock, but in such situation the system has effectively not enough
memory to operate properly.
Signed-off-by: David Sterba <dsterba@suse.com>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Not needed after the previous patch named
"Btrfs: fix page reading in extent_same ioctl leading to csum errors".
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Merge second patch-bomb from Andrew Morton:
- most of the rest of MM
- procfs
- lib/ updates
- printk updates
- bitops infrastructure tweaks
- checkpatch updates
- nilfs2 update
- signals
- various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
dma-debug, dma-mapping, ...
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
ipc,msg: drop dst nil validation in copy_msg
include/linux/zutil.h: fix usage example of zlib_adler32()
panic: release stale console lock to always get the logbuf printed out
dma-debug: check nents in dma_sync_sg*
dma-mapping: tidy up dma_parms default handling
pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
kexec: use file name as the output message prefix
fs, seqfile: always allow oom killer
seq_file: reuse string_escape_str()
fs/seq_file: use seq_* helpers in seq_hex_dump()
coredump: change zap_threads() and zap_process() to use for_each_thread()
coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
signals: kill block_all_signals() and unblock_all_signals()
nilfs2: fix gcc uninitialized-variable warnings in powerpc build
nilfs2: fix gcc unused-but-set-variable warnings
MAINTAINERS: nilfs2: add header file for tracing
nilfs2: add tracepoints for analyzing reading and writing metadata files
...
There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.
Let's introduce a helper function which makes the restriction explicit and
easier to track. This patch doesn't introduce any functional changes.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Below variables are defined per compress type.
- struct list_head comp_idle_workspace[BTRFS_COMPRESS_TYPES]
- spinlock_t comp_workspace_lock[BTRFS_COMPRESS_TYPES]
- int comp_num_workspace[BTRFS_COMPRESS_TYPES]
- atomic_t comp_alloc_workspace[BTRFS_COMPRESS_TYPES]
- wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES]
BTW, while accessing one compress type of these variables, the next or
before address is other compress types of it.
So this patch puts these variables in a struct to make cache friendly.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Convert kmalloc(nr * size, ..) to kmalloc_array that does additional
overflow checks, the zeroing variant is kcalloc.
Signed-off-by: David Sterba <dsterba@suse.cz>