We need to fill inode when we found a node for it in delayed_nodes_tree.
But we did not fill the ->last_trans currently, it will cause the test
of xfstest/generic/311 fail. Scenario of the 311 is shown as below:
Problem:
(1). test_fd = open(fname, O_RDWR|O_DIRECT)
(2). pwrite(test_fd, buf, 4096, 0)
(3). close(test_fd)
(4). drop_all_caches() <-------- "echo 3 > /proc/sys/vm/drop_caches"
(5). test_fd = open(fname, O_RDWR|O_DIRECT)
(6). fsync(test_fd);
<-------- we did not get the correct log entry for the file
Reason:
When we re-open this file in (5), we would find a node
in delayed_nodes_tree and fill the inode we are lookup with the
information. But the ->last_trans is not filled, then the fsync()
will check the ->last_trans and found it's 0 then say this inode
is already in our tree which is commited, not recording the extents
for it.
Fix:
This patch fill the ->last_trans properly and set the
runtime_flags if needed in this situation. Then we can get the
log entries we expected after (6) and generic/311 passed.
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.
For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:
clat percentiles (usec):
| 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34],
| 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35],
| 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80],
| 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155],
| 99.99th=[ 165]
After:
clat percentiles (usec):
| 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149],
| 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171],
| 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270],
| 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422],
| 99.99th=[ 438]
In other setups, Robert Elliott reported seeing good performance
improvements:
https://lkml.org/lkml/2015/4/3/557
The more applications accessing the device, the worse it gets.
Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull btrfs updates from Chris Mason:
"I've been running these through a longer set of load tests because my
commits change the free space cache writeout. It fixes commit stalls
on large filesystems (~20T space used and up) that we have been
triggering here. We were seeing new writers blocked for 10 seconds or
more during commits, which is far from good.
Josef and I fixed up ENOSPC aborts when deleting huge files (3T or
more), that are triggered because our metadata reservations were not
properly accounting for crcs and were not replenishing during the
truncate.
Also in this series, a number of qgroup fixes from Fujitsu and Dave
Sterba collected most of the pending cleanups from the list"
* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (93 commits)
btrfs: quota: Update quota tree after qgroup relationship change.
btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags when assigning/deleting a qgroup relations.
btrfs: qgroup: clear STATUS_FLAG_ON in disabling quota.
btrfs: Update btrfs qgroup status item when rescan is done.
btrfs: qgroup: Fix dead judgement on qgroup_rescan_leaf() return value.
btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created
btrfs: Check qgroup level in kernel qgroup assign.
btrfs: qgroup: allow to remove qgroup which has parent but no child.
btrfs: qgroup: return EINVAL if level of parent is not higher than child's.
btrfs: qgroup: do a reservation in a higher level.
Btrfs: qgroup, Account data space in more proper timings.
Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.
Btrfs: qgroup: free reserved in exceeding quota.
Btrfs: qgroup: cleanup, remove an unsued parameter in btrfs_create_qgroup().
btrfs: qgroup: fix limit args override whole limit struct
btrfs: qgroup: update limit info in function btrfs_run_qgroups().
btrfs: qgroup: consolidate the parameter of fucntion update_qgroup_limit_item().
btrfs: qgroup: update qgroup in memory at the same time when we update it in btree.
btrfs: qgroup: inherit limit info from srcgroup in creating snapshot.
btrfs: Support busy loop of write and delete
...
that's the bulk of filesystem drivers dealing with inodes of their own
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There are two problems in qgroup:
a). The PAGE_CACHE is 4K, even when we are writing a data of 1K,
qgroup will reserve a 4K size. It will cause the last 3K in a qgroup
is not available to user.
b). When user is writing a inline data, qgroup will not reserve it,
it means this is a window we can exceed the limit of a qgroup.
The main idea of this patch is reserving the data size of write_bytes
rather than the reserve_bytes. It means qgroup will not care about
the data size btrfs will reserve for user, but only care about the
data size user is going to write. Then reserve it when user want to
write and release it in transaction committed.
In this way, qgroup can be released from the complex procedure in
btrfs and only do the reserve when user want to write and account
when the data is written in commit_transaction().
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently, for pre_alloc or delay_alloc, the bytes will be accounted
in space_info by the three guys.
space_info->bytes_may_use --- space_info->reserved --- space_info->used.
But on the other hand, in qgroup, there are only two counters to account the
bytes, qgroup->reserved and qgroup->excl. And qg->reserved accounts
bytes in space_info->bytes_may_use and qg->excl accounts bytes in
space_info->used. So the bytes in space_info->reserved is not accounted
in qgroup. If so, there is a window we can exceed the quota limit when
bytes is in space_info->reserved.
Example:
# btrfs quota enable /mnt
# btrfs qgroup limit -e 10M /mnt
# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
# sync
# btrfs qgroup show -pcre /mnt
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 20987904 20987904 0 10485760 --- ---
qg->excl is 20987904 larger than max_excl 10485760.
This patch introduce a new counter named may_use to qgroup, then
there are three counters in qgroup to account bytes in space_info
as below.
space_info->bytes_may_use --- space_info->reserved --- space_info->used.
qgroup->may_use --- qgroup->reserved --- qgroup->excl
With this patch applied:
# btrfs quota enable /mnt
# btrfs qgroup limit -e 10M /mnt
# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
fallocate: /mnt/data9: fallocate failed: Disk quota exceeded
fallocate: /mnt/data10: fallocate failed: Disk quota exceeded
fallocate: /mnt/data11: fallocate failed: Disk quota exceeded
fallocate: /mnt/data12: fallocate failed: Disk quota exceeded
fallocate: /mnt/data13: fallocate failed: Disk quota exceeded
fallocate: /mnt/data14: fallocate failed: Disk quota exceeded
fallocate: /mnt/data15: fallocate failed: Disk quota exceeded
fallocate: /mnt/data16: fallocate failed: Disk quota exceeded
fallocate: /mnt/data17: fallocate failed: Disk quota exceeded
fallocate: /mnt/data18: fallocate failed: Disk quota exceeded
fallocate: /mnt/data19: fallocate failed: Disk quota exceeded
# sync
# btrfs qgroup show -pcre /mnt
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 9453568 9453568 0 10485760 --- ---
Reported-by: Cyril SCETBON <cyril.scetbon@free.fr>
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Steps to reproduce:
while true; do
dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
rm /btrfs_dir/file
sync
done
And we'll see dd failed because btrfs return NO_SPACE.
Reason:
Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
in end to free fs space for next write, but sometimes it hadn't
done work on time, because btrfs-cleaner thread get delayed-iputs
from list before, but do iput() after next write.
This is log:
[ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin
[ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs()
[ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs()
[ 2569.087554] comm=sync func=btrfs_commit_transaction() end
[ 2569.191081] comm=dd begin
[ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28
[ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888
[ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512
...
[ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496
[ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end
Fix:
Make btrfs_commit_transaction() wait current running btrfs-cleaner's
delayed-iputs() done in end.
Test:
Use script similar to above(more complex),
before patch:
7 failed in 100 * 20 loop.
after patch:
0 failed in 100 * 20 loop.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The rw parameter to direct_IO is redundant with iov_iter->type, and
treated slightly differently just about everywhere it's used: some users
do rw & WRITE, and others do rw == WRITE where they should be doing a
bitwise check. Simplify this with the new iov_iter_rw() helper, which
always returns either READ or WRITE.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most filesystems call through to these at some point, so we'll start
here.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
btrfs_evict_inode() needs to be more careful about stealing from the
global_rsv. We dont' want to end up aborting commit with ENOSPC just
because the evict_inode code was too greedy.
Signed-off-by: Chris Mason <clm@fb.com>
When truncate starts, it allocates some space in the block reserves so
that we'll have enough to update metadata along the way.
For very large files, we can easily go through all of that space as we
loop through the extents. This changes truncate to refill the space
reservation as it progresses through the file.
Signed-off-by: Chris Mason <clm@fb.com>
As we delete large extents, we end up doing huge amounts of COW in order
to delete the corresponding crcs. This adds accounting so that we keep
track of that space and flushing of delayed refs so that we don't build
up too much delayed crc work.
This helps limit the delayed work that must be done at commit time and
tries to avoid ENOSPC aborts because the crcs eat all the global
reserves.
Signed-off-by: Chris Mason <clm@fb.com>
When we are deleting large files with large extents, we are building up
a huge set of delayed refs for processing. Truncate isn't checking
often enough to see if we need to back off and process those, or let
a commit proceed.
The end result is long stalls after the rm, and very long commit times.
During the commits, other processes back up waiting to start new
transactions and we get into trouble.
Signed-off-by: Chris Mason <clm@fb.com>
Building alpha:allmodconfig fails with
fs/btrfs/inode.c: In function 'check_direct_IO':
fs/btrfs/inode.c:8050:2: error: implicit declaration of function 'iov_iter_alignment'
due to a missing include file.
Fixes: 3737c63e1fb0 ("fs: move struct kiocb to fs.h")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull btrfs fixes from Chris Mason:
"Most of these are fixing extent reservation accounting, or corners
with tree writeback during commit.
Josef's set does add a test, which isn't strictly a fix, but it'll
keep us from making this same mistake again"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix outstanding_extents accounting in DIO
Btrfs: add sanity test for outstanding_extents accounting
Btrfs: just free dummy extent buffers
Btrfs: account merges/splits properly
Btrfs: prepare block group cache before writing
Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)
Btrfs: account for the correct number of extents for delalloc reservations
Btrfs: fix merge delalloc logic
Btrfs: fix comp_oper to get right order
Btrfs: catch transaction abortion after waiting for it
btrfs: fix sizeof format specifier in btrfs_check_super_valid()
We are keeping track of how many extents we need to reserve properly based on
the amount we want to write, but we were still incrementing outstanding_extents
if we wrote less than what we requested. This isn't quite right since we will
be limited to our max extent size. So instead lets do something horrible! Keep
track of how many outstanding_extents we reserved, and decrement each time we
allocate an extent. If we use our entire reserve make sure to jack up
outstanding_extents on the inode so the accounting works out properly. Thanks,
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
I introduced a regression wrt outstanding_extents accounting. These are tricky
areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
at any time. So add sanity tests to cover the various conditions that are
tricky in order to make sure we don't introduce regressions in the future.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
My fix
Btrfs: fix merge delalloc logic
only fixed half of the problems, it didn't fix the case where we have two large
extents on either side and then join them together with a new small extent. We
need to instead keep track of how many extents we have accounted for with each
side of the new extent, and then see how many extents we need for the new large
extent. If they match then we know we need to keep our reservation, otherwise
we need to drop our reservation. This shows up with a case like this
[BTRFS_MAX_EXTENT_SIZE+4K][4K HOLE][BTRFS_MAX_EXTENT_SIZE+4K]
Previously the logic would have said that the number extents required for the
new size (3) is larger than the number of extents required for the largest side
(2) therefore we need to keep our reservation. But this isn't the case, since
both sides require a reservation of 2 which leads to 4 for the whole range
currently reserved, but we only need 3, so we need to drop one of the
reservations. The same problem existed for splits, we'd think we only need 3
extents when creating the hole but in reality we need 4. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
My patch to properly count outstanding extents wrt MAX_EXTENT_SIZE introduced a
regression when re-dirtying already dirty areas. We have logic in split to make
sure we are taking the largest space into account but didn't have it for merge,
so it was sometimes making us think we were turning a tiny extent into a huge
extent, when in reality we already had a huge extent and needed to use the other
side in our logic. This fixes the regression that was reported by a user on
list. Thanks,
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes from Chris Mason:
"Outside of misc fixes, Filipe has a few fsync corners and we're
pulling in one more of Josef's fixes from production use here"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.
Btrfs: fix data loss in the fast fsync path
Btrfs: remove extra run_delayed_refs in update_cowonly_root
Btrfs: incremental send, don't rename a directory too soon
btrfs: fix lost return value due to variable shadowing
Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr
Btrfs: fix off-by-one logic error in btrfs_realloc_node
Btrfs: add missing inode update when punching hole
Btrfs: abort the transaction if we fail to update the free space cache inode
Btrfs: fix fsync race leading to ordered extent memory leaks
Convert kmalloc(nr * size, ..) to kmalloc_array that does additional
overflow checks, the zeroing variant is kcalloc.
Signed-off-by: David Sterba <dsterba@suse.cz>
A block-local variable stores error code but btrfs_get_blocks_direct may
not return it in the end as there's a ret defined in the function scope.
CC: <stable@vger.kernel.org> # 3.6+
Fixes: d187663ef2 ("Btrfs: lock extents as we map them in DIO")
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs updates from Chris Mason:
"This pull is mostly cleanups and fixes:
- The raid5/6 cleanups from Zhao Lei fixup some long standing warts
in the code and add improvements on top of the scrubbing support
from 3.19.
- Josef has round one of our ENOSPC fixes coming from large btrfs
clusters here at FB.
- Dave Sterba continues a long series of cleanups (thanks Dave), and
Filipe continues hammering on corner cases in fsync and others
This all was held up a little trying to track down a use-after-free in
btrfs raid5/6. It's not clear yet if this is just made easier to
trigger with this pull or if its a new bug from the raid5/6 cleanups.
Dave Sterba is the only one to trigger it so far, but he has a
consistent way to reproduce, so we'll get it nailed shortly"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits)
Btrfs: don't remove extents and xattrs when logging new names
Btrfs: fix fsync data loss after adding hard link to inode
Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group
Btrfs: account for large extents with enospc
Btrfs: don't set and clear delalloc for O_DIRECT writes
Btrfs: only adjust outstanding_extents when we do a short write
btrfs: Fix out-of-space bug
Btrfs: scrub, fix sleep in atomic context
Btrfs: fix scheduler warning when syncing log
Btrfs: Remove unnecessary placeholder in btrfs_err_code
btrfs: cleanup init for list in free-space-cache
btrfs: delete chunk allocation attemp when setting block group ro
btrfs: clear bio reference after submit_one_bio()
Btrfs: fix scrub race leading to use-after-free
Btrfs: add missing cleanup on sysfs init failure
Btrfs: fix race between transaction commit and empty block group removal
btrfs: add more checks to btrfs_read_sys_array
btrfs: cleanup, rename a few variables in btrfs_read_sys_array
btrfs: add checks for sys_chunk_array sizes
btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize
...
On our gluster boxes we stream large tar balls of backups onto our fses. With
160gb of ram this means we get really large contiguous ranges of dirty data, but
the way our ENOSPC stuff works is that as long as it's contiguous we only hold
metadata reservation for one extent. The problem is we limit our extents to
128mb, so we'll end up with at least 800 extents so our enospc accounting is
quite a bit lower than what we need. To keep track of this make sure we
increase outstanding_extents for every multiple of the max extent size so we can
be sure to have enough reserved metadata space. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We do this to get the space accounting, but this is just needless churn on the
io_tree, so just drop setting/clearing delalloc and just drop the reserved data
space when we have a successfull allocation. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
We have this weird dance where we always inc outstanding_extents when we do a
O_DIRECT write, even if we allocate the entire range. To get around this we
also drop the metadata space if we successfully write. This is an unnecessary
dance, we only need to jack up outstanding_extents if we don't satisfy the
entire range request in get_blocks_direct, otherwise we are good using our
original reservation. So drop the unconditional inc and the drop of the
metadata space that we have for the unconditional inc. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch adds a new member to the 'struct btrfs_inode' structure to hold
the file creation time.
Signed-off-by: chandan <chandanrmail@gmail.com>
[refreshed, removed btrfs_inode_otime]
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
So we can check raid56 with:
(map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
instead of long:
(map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6))
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently there's a 4B hole in the structure between refs and state and there
are only 16 bits used so we can make it unsigned. This will get a better
packing and may save some stack space for local variables.
The size of extent_state gets reduced by 8B and there are usually a lot
of slab objects.
struct extent_state {
u64 start; /* 0 8 */
u64 end; /* 8 8 */
struct rb_node rb_node; /* 16 24 */
wait_queue_head_t wq; /* 40 24 */
/* --- cacheline 1 boundary (64 bytes) --- */
atomic_t refs; /* 64 4 */
/* XXX 4 bytes hole, try to pack */
long unsigned int state; /* 72 8 */
u64 private; /* 80 8 */
/* size: 88, cachelines: 2, members: 7 */
/* sum members: 84, holes: 1, sum holes: 4 */
/* last cacheline: 24 bytes */
};
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The errors are worth noting and might get missed with INFO level.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Several messages that point to some internal problem, level INFO is
wrong here.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
If the found_key is NULL, then btrfs_find_item becomes a verbose wrapper
for simple btrfs_search_slot.
After we've removed all such callers, passing a NULL key is not valid
anymore.
Signed-off-by: David Sterba <dsterba@suse.cz>
Pull btrfs fixes from Chris Mason:
"None of these are huge, but my commit does fix a regression from 3.18
that could cause lost files during log replay.
This also adds Dave Sterba to the list of Btrfs maintainers. It
doesn't mean we're doing things differently, but Dave has really been
helping with the maintainer workload for years"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: don't delay inode ref updates during log replay
Btrfs: correctly get tree level in tree_backref_for_extent
Btrfs: call inode_dec_link_count() on mkdir error path
Btrfs: abort transaction if we don't find the block group
Btrfs, scrub: uninitialized variable in scrub_extent_for_parity()
Btrfs: add more maintainers
In btrfs_mkdir(), if it fails to create dir, we should
clean up existed items, setting inode's link properly
to make sure it could be cleaned up properly.
Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs update from Chris Mason:
"From a feature point of view, most of the code here comes from Miao
Xie and others at Fujitsu to implement scrubbing and replacing devices
on raid56. This has been in development for a while, and it's a big
improvement.
Filipe and Josef have a great assortment of fixes, many of which solve
problems corruptions either after a crash or in error conditions. I
still have a round two from Filipe for next week that solves
corruptions with discard and block group removal"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
Btrfs: make get_caching_control unconditionally return the ctl
Btrfs: fix unprotected deletion from pending_chunks list
Btrfs: fix fs mapping extent map leak
Btrfs: fix memory leak after block remove + trimming
Btrfs: make btrfs_abort_transaction consider existence of new block groups
Btrfs: fix race between writing free space cache and trimming
Btrfs: fix race between fs trimming and block group remove/allocation
Btrfs, replace: enable dev-replace for raid56
Btrfs: fix freeing used extents after removing empty block group
Btrfs: fix crash caused by block group removal
Btrfs: fix invalid block group rbtree access after bg is removed
Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
Btrfs, replace: write raid56 parity into the replace target device
Btrfs, replace: write dirty pages into the replace target device
Btrfs, raid56: support parity scrub on raid56
Btrfs, raid56: use a variant to record the operation type
Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
Btrfs, raid56: don't change bbio and raid_map
Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
...
If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.
For example, if we perform the following file operations:
$ mkfs.btrfs -f /dev/vdd
$ mount /dev/vdd /mnt
$ xfs_io -f \
-c "pwrite -S 0xaa -b 32K 0 32K" \
-c "fsync" \
-c "pwrite -S 0xbb -b 32770 16K 32770" \
-c "truncate 90123" \
/mnt/foobar
and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:
item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
inode ref index 282 namelen 10 name: foobar
item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
extent data disk byte 1104855040 nr 32768
extent data offset 0 nr 32768 ram 32768
extent compression 0
item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
extent data disk byte 0 nr 0
extent data offset 0 nr 40960 ram 40960
extent compression 0
There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.
Btrfs' fsck tool complains about such cases with a message like the following:
"root 331 inode 257 errors 100, file extent discount"
>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:
1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured
But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).
A test case for xfstests follows.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
When doing a fsync with a fast path we have a time window where we can miss
the fact that writeback of some file data failed, and therefore we endup
returning success (0) from fsync when we should return an error.
The steps that lead to this are the following:
1) We start all ordered extents by calling filemap_fdatawrite_range();
2) We do some other work like locking the inode's i_mutex, start a transaction,
start a log transaction, etc;
3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
ordered extents from inode's ordered tree into a list;
4) But by the time we do ordered extent collection, some ordered extents we started
at step 1) might have already completed with an error, and therefore we didn't
found them in the ordered tree and had no idea they finished with an error. This
makes our fsync return success (0) to userspace, but has no bad effects on the log
like for example insertion of file extent items into the log that point to unwritten
extents, because the invalid extent maps were removed before the ordered extent
completed (in inode.c:btrfs_finish_ordered_io).
So after collecting the ordered extents just check if the inode's i_mapping has any
error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
writeback fails for a page of an ordered extent, we call mapping_set_error (done in
extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
that sets one of those error flags in the inode's i_mapping flags.
This change also has the side effect of fixing the issue where for fast fsyncs we
never checked/cleared the error flags from the inode's i_mapping flags, which means
that a full fsync performed after a fast fsync could get such errors that belonged
to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
calls filemap_fdatawait_range(), and the later checks for and clears those flags,
while for fast fsyncs we never call filemap_fdatawait_range() or anything else
that checks for and clears the error flags from the inode's i_mapping.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If cow_file_range_inline() failed, when called from compress_file_range(),
we were tagging the locked page for writeback, end its writeback and unlock it,
but not marking it with an error nor setting AS_EIO in inode's mapping flags.
This made it impossible for a caller of filemap_fdatawrite_range (writepages)
or filemap_fdatawait_range() to know that an error happened. And the return
value of compress_file_range() is useless because it's returned to a workqueue
task and not to the task calling filemap_fdatawrite_range (writepages).
This change applies on top of the previous patchset starting at the patch
titled:
"[1/5] Btrfs: set page and mapping error on compressed write failure"
Which changed extent_clear_unlock_delalloc() to use SetPageError and
mapping_set_error().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
To avoid duplicating this double filemap_fdatawrite_range() call for
inodes with async extents (compressed writes) so often.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
For compressed writes, after doing the first filemap_fdatawrite_range() we
don't get the pages tagged for writeback immediately. Instead we create
a workqueue task, which is run by other kthread, and keep the pages locked.
That other kthread compresses data, creates the respective ordered extent/s,
tags the pages for writeback and unlocks them. Therefore we need a second
call to filemap_fdatawrite_range() if we have compressed writes, as this
second call will wait for the pages to become unlocked, then see they became
tagged for writeback and finally wait for the writeback to finish.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Its return value is useless, its single caller ignores it and can't do
anything with it anyway, since it's a workqueue task and not the task
calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range().
Failure is communicated to such functions via start and end of writeback
with the respective pages tagged with an error and AS_EIO flag set in the
inode's imapping.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Steps to reproduce:
# mkfs.btrfs -f /dev/sdb
# mount -t btrfs /dev/sdb /mnt -o compress=lzo
# dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1
after previous steps, inode will be detected as bad compression ratio,
and NOCOMPRESS flag will be set for that inode.
Reason is that compress have a max limit pages every time(128K), if a
132k write in, it will be splitted into two write(128k+4k), this bug
is a leftover for commit 68bb462d42a(Btrfs: don't compress for a small write)
Fix this problem by checking every time before compression, if it is a
small write(<=blocksize), we bail out and fall into nocompression directly.
Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Its return value is completely ignored by its single caller and it's
useless anyway, since errors are indicated through SetPageError and
the bit AS_EIO set in the flags of the inode's mapping. The caller
can't do anything with the value, as it's invoked from a workqueue
task and not by the task calling filemap_fdatawrite_range (which calls
the writepages address space callback, which in turn calls the inode's
fill_delalloc callback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we had an error when processing one of the async extents from our list,
we were not processing the remaining async extents, meaning we would leak
those async_extent structs, never release the pages with the compressed
data and never unlock and clear the dirty flag from the inode's pages (those
that correspond to the uncompressed content).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
In inode.c:submit_compressed_extents(), if we fail before calling
btrfs_submit_compressed_write(), or when that function fails, we
were freeing the async_extent structure without releasing its pages
and freeing the pages array.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
In inode.c:submit_compressed_extents(), before calling btrfs_submit_compressed_write()
we start writeback for all pages, clear their dirty flag, unlock them, etc, but if
btrfs_submit_compressed_write() fails (at the moment it can only fail with -ENOMEM),
we never end the writeback on the pages, so any filemap_fdatawait_range() call will
hang forever. We were also not calling the writepage end io hook, which means the
corresponding ordered extent will never complete and all its waiters will block
forever, such as a full fsync (via btrfs_wait_ordered_range()).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
but we don't tag the pages, nor the inode's mapping, with an error. This makes it
impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
for e.g.) know that there was an error.
Note that the return value of submit_compressed_extents() is useless, as that function
is executed by a workqueue task and not directly by the fill_delalloc callback. This
means the writepage/s callbacks of the inode's address space operations don't get that
return value.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
This reverts commit 9c3b306e1c.
Switching only one commit root during a transaction is wrong because it
leads the fs into an inconsistent state. All commit roots should be
switched at once, at transaction commit time, otherwise backref walking
can often miss important references that were only accessible through
the old commit root. Plus, the root item for the snapshot's root wasn't
getting updated and preventing the next transaction commit to do it.
This made several users get into random corruption issues after creation
of readonly snapshots.
A regression test for xfstests will follow soon.
Cc: stable@vger.kernel.org # 3.17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Previous commit: btrfs: Fix and enhance merge_extent_mapping() to insert
best fitted extent map
is using wrong condition to judgement whether the range is a subset of a
existing extent map.
This may cause bug in btrfs no-holes mode.
This patch will correct the judgment and fix the bug.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
There are the branch hints that obviously depend on the data being
processed, the CPU predictor will do better job according to the actual
load. It also does not make sense to use the hints in slow paths that do
a lot of other operations like locking, waiting or IO.
Signed-off-by: David Sterba <dsterba@suse.cz>
When doing log replay we may have to update inodes, which traditionally goes
through our delayed inode stuff. This will try to move space over from the
trans handle, but we don't reserve space in our trans handle on replay since we
don't know how much we will need, so instead we try to flush. But because we
have a trans handle open we won't flush anything, so if we are out of reserve
space we will simply return ENOSPC. Since we know that if an operation made it
into the log then we definitely had space before the box bought the farm then we
don't need to worry about doing this space reservation. Use the
fs_info->log_root_recovering flag to skip the delayed inode stuff and update the
item directly. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The following commit enhanced the merge_extent_mapping() to reduce
fragment in extent map tree, but it can't handle case which existing
lies before map_start:
51f39 btrfs: Use right extent length when inserting overlap extent map.
[BUG]
When existing extent map's start is before map_start,
the em->len will be minus, which will corrupt the extent map and fail to
insert the new extent map.
This will happen when someone get a large extent map, but when it is
going to insert it into extent map tree, some one has already commit
some write and split the huge extent into small parts.
[REPRODUCER]
It is very easy to tiger using filebench with randomrw personality.
It is about 100% to reproduce when using 8G preallocated file in 60s
randonrw test.
[FIX]
This patch can now handle any existing extent position.
Since it does not directly use existing->start, now it will find the
previous and next extent around map_start.
So the old existing->start < map_start bug will never happen again.
[ENHANCE]
This patch will insert the best fitted extent map into extent map tree,
other than the oldest [map_start, map_start + sectorsize) or the
relatively newer but not perfect [map_start, existing->start).
The patch will first search existing extent that does not intersects with
the desired map range [map_start, map_start + len).
The existing extent will be either before or behind map_start, and based
on the existing extent, we can find out the previous and next extent
around map_start.
So the best fitted extent would be [prev->end, next->start).
For prev or next is not found, em->start would be prev->end and em->end
wold be next->start.
With this patch, the fragment in extent map tree should be reduced much
more than the 51f39 commit and reduce an unneeded extent map tree search.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
the corrupted data is corrected, so the failure record can be removed. And if
some errors happen on the mirrors, we also needn't worry about it because the
failure record will be recreated if we read the same place again.
Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch implement data repair function when direct read fails.
The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
mirror.
- When the io on the mirror ends, we will insert the endio work into the
dedicated btrfs workqueue, not common read endio workqueue, because the
original endio work is still blocked in the btrfs endio workqueue, if we
insert the endio work of the io on the mirror into that workqueue, deadlock
would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.
But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file data
check in the final end io function to the sub-bio end io function, in which we can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs could still inline file data if its size is same as
page size, so don't skip max value here.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
If flag NOCOMPRESS is set which means bad compression ratio,
we could avoid call cow_file_range_async() for this case earlier.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
If a file's compression ratios is bad, we will set NOCOMPRESS
flag for it, and it will skip compression for that inode next time.
However, if we remount fs to COMPRESS_FORCE, it still should try
if we could compress pages for that inode, this patch fix wrong
check for this problem.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
We were returning with 0 (success) because we weren't extracting the
error code from em (PTR_ERR(em)). Fix it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs defragment will utilize COW feature, which means this
did not work for nodatacow option, this problem was detected
by xfstests generic/018 with nodatacow mount option.
Fix this problem by forcing cow for a extent with state
@EXTETN_DEFRAG setting.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_set_key_type and btrfs_key_type are used inconsistently along with
open coded variants. Other members of btrfs_key are accessed directly
without any helpers anyway.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes from Chris Mason:
"Filipe is doing a careful pass through fsync problems, and these are
the fixes so far. I'll have one more for rc6 that we're still
testing.
My big commit is fixing up some inode hash races that Al Viro found
(thanks Al)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: use insert_inode_locked4 for inode creation
Btrfs: fix fsync data loss after a ranged fsync
Btrfs: kfree()ing ERR_PTRs
Btrfs: fix crash while doing a ranged fsync
Btrfs: fix corruption after write/fsync failure + fsync + log recovery
Btrfs: fix autodefrag with compression
Btrfs was inserting inodes into the hash table before we had fully
set the inode up on disk. This leaves us open to rare races that allow
two different inodes in memory for the same [root, inode] pair.
This patch fixes things by using insert_inode_locked4 to insert an I_NEW
inode and unlock_new_inode when we're ready for the rest of the kernel
to use the inode.
It also makes sure to init the operations pointers on the inode before
going into the error handling paths.
Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
While doing a ranged fsync, that is, one whose range doesn't cover the
whole possible file range (0 to LLONG_MAX), we can crash under certain
circumstances with a trace like the following:
[41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
(...)
[41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
(...)
[41074.643886] RIP: 0010:[<ffffffffa01ecc99>] [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
(...)
[41074.644919] Stack:
(...)
[41074.644919] Call Trace:
[41074.644919] [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
[41074.644919] [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
[41074.644919] [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
[41074.644919] [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
[41074.644919] [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
[41074.644919] [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
[41074.644919] [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
[41074.644919] [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
[41074.644919] [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
[41074.644919] [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
[41074.644919] [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
(...)
The necessary conditions that lead to such crash are:
* an incremental fsync (when the inode doesn't have the
BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
a file extent item ending at offset X;
* the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
to a file truncate operation that reduces the file to a size smaller
than X;
* a ranged fsync call happens (via an msync for example), with a range that
doesn't cover the whole file and the end of this range, lets call it Y, is
smaller than X;
* btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
calls btrfs_truncate_inode_items() to remove all items from the log
tree that are associated with our file;
* btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
file extent item it removed is the one ending at offset X, where X > 0 and
X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
parameter set to X;
* btrfs_ordered_update_i_size() sees that X is greater then the current ordered
size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
ordered operation with a range covering the offset X, calling a BUG_ON() if
such ordered operation exists. This assumption is made because the disk_i_size
is only increased after the corresponding file extent item is added to the
btree (btrfs_finish_ordered_io);
* But because our fsync covers only a limited range, such an ordered extent might
exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered
extent to finish when calling btrfs_wait_ordered_range();
And then by the time btrfs_ordered_update_i_size() is called, via:
btrfs_sync_file() ->
btrfs_log_dentry_safe() ->
btrfs_log_inode_parent() ->
btrfs_log_inode() ->
btrfs_truncate_inode_items() ->
btrfs_ordered_update_i_size()
We hit the BUG_ON(), which could never happen if the fsync range covered the whole
possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to
finish before calling btrfs_truncate_inode_items().
So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items
from a log tree, which isn't supposed to change the in memory inode's disk_i_size.
Issue found while running xfstests/generic/127 (happens very rarely for me), more
specifically via the fsx calls that use memory mapped IO (and issue msync calls).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
While writing to a file, in inode.c:cow_file_range() (and same applies to
submit_compressed_extents()), after reserving an extent for the file data,
we create a new extent map for the written range and insert it into the
extent map cache. After that, we create an ordered operation, but if it
fails (due to a transient/temporary-ENOMEM), we return without dropping
that extent map, which points to a reserved extent that is freed when we
return. A subsequent incremental fsync (when the btrfs inode doesn't have
the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
logs a file extent item based on that extent map, which points to a disk
extent that doesn't contain valid data - it was freed by us earlier, at this
point it might contain any random/garbage data.
Therefore, if we reach an error condition when cowing a file range after
we added the new extent map to the cache, drop it from the cache before
returning.
Some sequence of steps that lead to this:
$ mkfs.btrfs -f /dev/sdd
$ mount -o commit=9999 /dev/sdd /mnt
$ cd /mnt
$ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
$ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
$ sync
$ od -t x1 foo
0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
*
0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
*
0020000
$ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo
# Now this write + fsync fail with -ENOMEM, which was returned by
# btrfs_add_ordered_extent() in inode.c:cow_file_range().
$ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
$ xfs_io -c "fsync" foo
fsync: Cannot allocate memory
# Now do a new write + fsync, which will succeed. Our previous
# -ENOMEM was a transient/temporary error.
$ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
$ xfs_io -c "fsync" foo
# Our file content (in page cache) is now:
$ od -t x1 foo
0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
*
0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
*
0050000
# Now reboot the machine, and mount the fs, so that fsync log replay
# takes place.
# The file content is now weird, in particular the first 8Kb, which
# do not match our data before nor after the sync command above.
$ od -t x1 foo
0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
*
0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
*
0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
*
0050000
# In fact these first 4Kb are a duplicate of the last 4kb block.
# The last write got an extent map/file extent item that points to
# the same disk extent that we got in the write+fsync that failed
# with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
# verify that:
$ btrfs-debug-tree /dev/sdd
(...)
item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
extent data disk byte 12582912 nr 8192
extent data offset 0 nr 8192 ram 8192
item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
extent data disk byte 0 nr 0
extent data offset 0 nr 8192 ram 8192
item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
extent data disk byte 12582912 nr 4096
extent data offset 0 nr 4096 ram 4096
$ umount /dev/sdd
$ btrfsck /dev/sdd
Checking filesystem on /dev/sdd
UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
checking extents
extent item 12582912 has multiple extent items
ref mismatch on [12582912 4096] extent item 1, found 2
Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
backpointer mismatch on [12582912 4096]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
root 5 inode 257 errors 1000, some csum missing
found 131074 bytes used err is 1
total csum bytes: 4
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 123404
file data blocks allocated: 274432
referenced 274432
Btrfs v3.14.1-96-gcc7fd5a-dirty
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes from Chris Mason:
"The biggest of these comes from Liu Bo, who tracked down a hang we've
been hitting since moving to kernel workqueues (it's a btrfs bug, not
in the generic code). His patch needs backporting to 3.16 and 3.15
stable, which I'll send once this is in.
Otherwise these are assorted fixes. Most were integrated last week
during KS, but I wanted to give everyone the chance to test the
result, so I waited for rc2 to come out before sending"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
Btrfs: fix task hang under heavy compressed write
Btrfs: fix filemap_flush call in btrfs_file_release
Btrfs: fix crash on endio of reading corrupted block
btrfs: fix leak in qgroup_subtree_accounting() error path
btrfs: Use right extent length when inserting overlap extent map.
Btrfs: clone, don't create invalid hole extent map
Btrfs: don't monopolize a core when evicting inode
Btrfs: fix hole detection during file fsync
Btrfs: ensure tmpfile inode is always persisted with link count of 0
Btrfs: race free update of commit root for ro snapshots
Btrfs: fix regression of btrfs device replace
Btrfs: don't consider the missing device when allocating new chunks
Btrfs: Fix wrong device size when we are resizing the device
Btrfs: don't write any data into a readonly device when scrub
Btrfs: Fix the problem that the replace destroys the seed filesystem
btrfs: Return right extent when fiemap gives unaligned offset and len.
Btrfs: fix wrong extent mapping for DirectIO
Btrfs: fix wrong write range for filemap_fdatawrite_range()
Btrfs: fix wrong missing device counter decrease
Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs
...
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.
Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --
normal_work_helper(arg)
work = container_of(arg, struct btrfs_work, normal_work);
work->func() <---- (we name it work X)
for ordered_work in wq->ordered_list
ordered_work->ordered_func()
ordered_work->ordered_free()
The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will
file a readahead request
btrfs_readpages()
for page that is not in page cache
__do_readpage()
submit_extent_page()
btrfs_submit_bio_hook()
btrfs_bio_wq_end_io()
submit_bio()
end_workqueue_bio() <--(ret by the 1st endio)
queue a work(named work Y) for the 2nd
also the real endio()
So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.
A bit more explanation,
A,B,C -- struct btrfs_work
arg -- struct work_struct
kthread:
worker_thread()
pick up a work_struct from @worklist
process_one_work(arg)
worker->current_work = arg; <-- arg is A->normal_work
worker->current_func(arg)
normal_work_helper(arg)
A = container_of(arg, struct btrfs_work, normal_work);
A->func()
A->ordered_func()
A->ordered_free() <-- A gets freed
B->ordered_func()
submit_compressed_extents()
find_free_extent()
load_free_space_inode()
... <-- (the above readhead stack)
end_workqueue_bio()
btrfs_queue_work(work C)
B->ordered_free()
As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.
Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).
When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.
So the situation is that our kthread is waiting forever on work C.
Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.
With this patch, I no long hit the above hang.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
When current btrfs finds that a new extent map is going to be insereted
but failed with -EEXIST, it will try again to insert the extent map
but with the length of sectorsize.
This is OK if we don't enable 'no-holes' feature since all extent space
is continuous, we will not go into the not found->insert routine.
But if we enable 'no-holes' feature, it will make things out of control.
e.g. in 4K sectorsize, we pass the following args to btrfs_get_extent():
btrfs_get_extent() args: start: 27874 len 4100
28672 27874 28672 27874+4100 32768
|-----------------------|
|---------hole--------------------|---------data----------|
1) not found and insert
Since no extent map containing the range, btrfs_get_extent() will go
into the not_found and insert routine, which will try to insert the
extent map (27874, 27847 + 4100).
2) first overlap
But it overlaps with (28672, 32768) extent, so -EEXIST will be returned
by add_extent_mapping().
3) retry but still overlap
After catching the -EEXIST, then btrfs_get_extent() will try insert it
again but with 4K length, which still overlaps, so -EEXIST will be
returned.
This makes the following patch fail to punch hole.
d77815461f btrfs: Avoid trucating page or punching hole in a already existed hole.
This patch will use the right length, which is the (exsisting->start -
em->start) to insert, making the above patch works in 'no-holes' mode.
Also, some small code style problems in above patch is fixed too.
Reported-by: Filipe David Manana <fdmanana@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe David Manana <fdmanana@suse.com>
Tested-by: Filipe David Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If an inode has a very large number of extent maps, we can spend
a lot of time freeing them, which triggers a soft lockup warning.
Therefore reschedule if we need to when freeing the extent maps
while evicting the inode.
I could trigger this all the time by running xfstests/generic/299 on
a file system with the no-holes feature enabled. That test creates
an inode with 11386677 extent maps.
$ mkfs.btrfs -f -O no-holes $TEST_DEV
$ MKFS_OPTIONS="-O no-holes" ./check generic/299
generic/299 382s ...
Message from syslogd@debian-vm3 at Aug 7 10:44:29 ...
kernel:[85304.208017] BUG: soft lockup - CPU#0 stuck for 22s! [umount:25330]
384s
Ran: generic/299
Passed all 1 tests
$ dmesg
(...)
[86304.300017] BUG: soft lockup - CPU#0 stuck for 23s! [umount:25330]
(...)
[86304.300036] Call Trace:
[86304.300036] [<ffffffff81698ba9>] __slab_free+0x54/0x295
[86304.300036] [<ffffffffa02ee9cc>] ? free_extent_map+0x5c/0xb0 [btrfs]
[86304.300036] [<ffffffff811a6cd2>] kmem_cache_free+0x282/0x2a0
[86304.300036] [<ffffffffa02ee9cc>] free_extent_map+0x5c/0xb0 [btrfs]
[86304.300036] [<ffffffffa02e3775>] btrfs_evict_inode+0xd5/0x660 [btrfs]
[86304.300036] [<ffffffff811e7c8d>] ? __inode_wait_for_writeback+0x6d/0xc0
[86304.300036] [<ffffffff816a389b>] ? _raw_spin_unlock+0x2b/0x40
[86304.300036] [<ffffffff811d8cbb>] evict+0xab/0x180
[86304.300036] [<ffffffff811d8dce>] dispose_list+0x3e/0x60
[86304.300036] [<ffffffff811d9b04>] evict_inodes+0xf4/0x110
[86304.300036] [<ffffffff811bd953>] generic_shutdown_super+0x53/0x110
[86304.300036] [<ffffffff811bdaa6>] kill_anon_super+0x16/0x30
[86304.300036] [<ffffffffa02a78ba>] btrfs_kill_super+0x1a/0xa0 [btrfs]
[86304.300036] [<ffffffff811bd3a9>] deactivate_locked_super+0x59/0x80
[86304.300036] [<ffffffff811be44e>] deactivate_super+0x4e/0x70
[86304.300036] [<ffffffff811dec14>] mntput_no_expire+0x174/0x1f0
[86304.300036] [<ffffffff811deab7>] ? mntput_no_expire+0x17/0x1f0
[86304.300036] [<ffffffff811e0517>] SyS_umount+0x97/0x100
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we open a file with O_TMPFILE, don't do any further operation on
it (so that the inode item isn't updated) and then force a transaction
commit, we get a persisted inode item with a link count of 1, and not 0
as it should be.
Steps to reproduce it (requires a modern xfs_io with -T support):
$ mkfs.btrfs -f /dev/sdd
$ mount -o /dev/sdd /mnt
$ xfs_io -T /mnt &
$ sync
Then btrfs-debug-tree shows the inode item with a link count of 1:
$ btrfs-debug-tree /dev/sdd
(...)
fs tree key (FS_TREE ROOT_ITEM 0)
leaf 29556736 items 4 free space 15851 generation 6 owner 5
fs uuid f164d01b-1b92-481d-a4e4-435fb0f843d0
chunk uuid 0e3d0e56-bcca-4a1c-aa5f-cec2c6f4f7a6
item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
inode generation 3 transid 6 size 0 block group 0 mode 40755 links 1
item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
inode ref index 0 namelen 2 name: ..
item 2 key (257 INODE_ITEM 0) itemoff 15951 itemsize 160
inode generation 6 transid 6 size 0 block group 0 mode 100600 links 1
item 3 key (ORPHAN ORPHAN_ITEM 257) itemoff 15951 itemsize 0
orphan item
checksum tree key (CSUM_TREE ROOT_ITEM 0)
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
This is a better solution for the problem addressed in the following
commit:
Btrfs: update commit root on snapshot creation after orphan cleanup
(3821f34888)
The previous solution wasn't the best because of 2 reasons:
1) It added another full transaction commit, which is more expensive
than just swapping the commit root with the root;
2) If a reboot happened after the first transaction commit (the one
that creates the snapshot) and before the second transaction commit,
then we would end up with the same problem if a send using that
snapshot was requested before the first transaction commit after
the reboot.
This change addresses those 2 issues. The second issue is addressed by
switching the commit root in the dentry lookup VFS callback, which is
also called by the snapshot/subvol creation ioctl and performs orphan
cleanup if needed. Like the vfs, the ioctl locks the parent inode too,
preventing race issues between a dentry lookup and snapshot creation.
Cc: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_next_leaf() will use current leaf's last key to search
and then return a bigger one. So it may still return a file extent
item that is smaller than expected value and we will
get an overflow here for @em->len.
This is easy to reproduce for Btrfs Direct writting, it did not
cause any problem, because writting will re-insert right mapping later.
However, by hacking code to make DIO support compression, wrong extent
mapping is kept and it encounter merging failure(EEXIST) quickly.
Fix this problem by looping to find next file extent item that is bigger
than @start or we could not find anything more.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
filemap_fdatawrite_range() expect the third arg to be @end
not @len, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The caller of btrfs_submit_direct_hook() will put the original dio bio
when btrfs_submit_direct_hook() return a error number, so we needn't
put the original bio in btrfs_submit_direct_hook().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs updates from Chris Mason:
"These are all fixes I'd like to get out to a broader audience.
The biggest of the bunch is Mark's quota fix, which is also in the
SUSE kernel, and makes our subvolume quotas dramatically more
accurate.
I've been running xfstests with these against your current git
overnight, but I'm queueing up longer tests as well"
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: disable strict file flushes for renames and truncates
Btrfs: fix csum tree corruption, duplicate and outdated checksums
Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch
Btrfs: fix compressed write corruption on enospc
btrfs: correctly handle return from ulist_add
btrfs: qgroup: account shared subtrees during snapshot delete
Btrfs: read lock extent buffer while walking backrefs
Btrfs: __btrfs_mod_ref should always use no_quota
btrfs: adjust statfs calculations according to raid profiles
Truncates and renames are often used to replace old versions of a file
with new versions. Applications often expect this to be an atomic
replacement, even if they haven't done anything to make sure the new
version is fully on disk.
Btrfs has strict flushing in place to make sure that renaming over an
old file with a new file will fully flush out the new file before
allowing the transaction commit with the rename to complete.
This ordering means the commit code needs to be able to lock file pages,
and there are a few paths in the filesystem where we will try to end a
transaction with the page lock held. It's rare, but these things can
deadlock.
This patch removes the ordered flushes and switches to a best effort
filemap_flush like ext4 uses. It's not perfect, but it should fix the
deadlocks.
Signed-off-by: Chris Mason <clm@fb.com>
When failing to allocate space for the whole compressed extent, we'll
fallback to uncompressed IO, but we've forgotten to redirty the pages
which belong to this compressed extent, and these 'clean' pages will
simply skip 'submit' part and go to endio directly, at last we got data
corruption as we write nothing.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Tested-By: Martin Steigerwald <martin@lichtvoll.de>
Signed-off-by: Chris Mason <clm@fb.com>
RENAME_NOREPLACE is trivial to implement for most filesystems: switch over
to ->rename2() and check for the supported flags. The rest is done by the
VFS.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull btrfs fixes from Chris Mason:
"This fixes some lockups in btrfs reported with rc1. It probably has
some performance impact because it is backing off our spinning locks
more often and switching to a blocking lock. I'll be able to nail
that down next week, but for now I want to get the lockups taken care
of.
Otherwise some more stack reduction and assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix wrong error handle when the device is missing or is not writeable
Btrfs: fix deadlock when mounting a degraded fs
Btrfs: use bio_endio_nodec instead of open code
Btrfs: fix NULL pointer crash when running balance and scrub concurrently
btrfs: Skip scrubbing removed chunks to avoid -ENOENT.
Btrfs: fix broken free space cache after the system crashed
Btrfs: make free space cache write out functions more readable
Btrfs: remove unused wait queue in struct extent_buffer
Btrfs: fix deadlocks with trylock on tree nodes
When we mounted the filesystem after the crash, we got the following
message:
BTRFS error (device xxx): block group xxxx has wrong amount of free space
BTRFS error (device xxx): failed to load free space cache for block group xxx
It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.
There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
space cache
- account the size of the allocated space that is used to store the file
data, if the size is not zero, don't write out the free space cache.
The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull vfs updates from Al Viro:
"This the bunch that sat in -next + lock_parent() fix. This is the
minimal set; there's more pending stuff.
In particular, I really hope to get acct.c fixes merged this cycle -
we need that to deal sanely with delayed-mntput stuff. In the next
pile, hopefully - that series is fairly short and localized
(kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
iov_iter work. Most of prereqs for ->splice_write with sane locking
order are there and Kent's dio rewrite would also fit nicely on top of
this pile"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
lock_parent: don't step on stale ->d_parent of all-but-freed one
kill generic_file_splice_write()
ceph: switch to iter_file_splice_write()
shmem: switch to iter_file_splice_write()
nfs: switch to iter_splice_write_file()
fs/splice.c: remove unneeded exports
ocfs2: switch to iter_file_splice_write()
->splice_write() via ->write_iter()
bio_vec-backed iov_iter
optimize copy_page_{to,from}_iter()
bury generic_file_aio_{read,write}
lustre: get rid of messing with iovecs
ceph: switch to ->write_iter()
ceph_sync_direct_write: stop poking into iov_iter guts
ceph_sync_read: stop poking into iov_iter guts
new helper: copy_page_from_iter()
fuse: switch to ->write_iter()
btrfs: switch to ->write_iter()
ocfs2: switch to ->write_iter()
xfs: switch to ->write_iter()
...
Pull btrfs updates from Chris Mason:
"The biggest change here is Josef's rework of the btrfs quota
accounting, which improves the in-memory tracking of delayed extent
operations.
I had been working on Btrfs stack usage for a while, mostly because it
had become impossible to do long stress runs with slab, lockdep and
pagealloc debugging turned on without blowing the stack. Even though
you upgraded us to a nice king sized stack, I kept most of the
patches.
We also have some very hard to find corruption fixes, an awesome sysfs
use after free, and the usual assortment of optimizations, cleanups
and other fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (80 commits)
Btrfs: convert smp_mb__{before,after}_clear_bit
Btrfs: fix scrub_print_warning to handle skinny metadata extents
Btrfs: make fsync work after cloning into a file
Btrfs: use right type to get real comparison
Btrfs: don't check nodes for extent items
Btrfs: don't release invalid page in btrfs_page_exists_in_range()
Btrfs: make sure we retry if page is a retriable exception
Btrfs: make sure we retry if we couldn't get the page
btrfs: replace EINVAL with EOPNOTSUPP for dev_replace raid56
trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/
Btrfs: fix leaf corruption after __btrfs_drop_extents
Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item
Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled
btrfs: free delayed node outside of root->inode_lock
btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX
Btrfs: fix transaction leak during fsync call
btrfs: Avoid trucating page or punching hole in a already existed hole.
Btrfs: update commit root on snapshot creation after orphan cleanup
Btrfs: ioctl, don't re-lock extent range when not necessary
Btrfs: avoid visiting all extent items when cloning a range
...
When cloning into a file, we were correctly replacing the extent
items in the target range and removing the extent maps. However
we weren't replacing the extent maps with new ones that point to
the new extents - as a consequence, an incremental fsync (when the
inode doesn't have the full sync flag) was a NOOP, since it relies
on the existence of extent maps in the modified list of the inode's
extent map tree, which was empty. Therefore add new extent maps to
reflect the target clone range.
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
In inode.c:btrfs_page_exists_in_range(), if the page we got from
the radix tree is an exception entry, which can't be retried, we
exit the loop with a non-NULL page and then call page_cache_release
against it, which is not ok since it's not a valid page. This could
also make us return true when we shouldn't.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
In inode.c:btrfs_page_exists_in_range(), if the page we get from the
radix tree is an exception which should make us retry, set page to
NULL in order to really retry, because otherwise we don't get another
loop iteration executed (page != NULL makes the while loop exit).
This also was making us call page_cache_release after exiting the loop,
which isn't correct because page doesn't point to a valid page, and
possibly return true from the function when we shouldn't.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
In inode.c:btrfs_page_exists_in_range(), if we can't get the page
we need to retry. However we weren't retrying because we weren't
setting page to NULL, which makes the while loop exit immediately
and will make us call page_cache_release after exiting the loop
which is incorrect because our page get didn't succeed. This could
also make us return true when we shouldn't.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Delayed extent operations are triggered during transaction commits.
The goal is to queue up a healthly batch of changes to the extent
allocation tree and run through them in bulk.
This farms them off to async helper threads. The goal is to have the
bulk of the delayed operations being done in the background, but this is
also important to limit our stack footprint.
Signed-off-by: Chris Mason <clm@fb.com>