2007-06-12 20:07:21 +07:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2007 Oracle. All rights reserved.
|
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of the GNU General Public
|
|
|
|
* License v2 as published by the Free Software Foundation.
|
|
|
|
*
|
|
|
|
* This program is distributed in the hope that it will be useful,
|
|
|
|
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
|
|
|
* General Public License for more details.
|
|
|
|
*
|
|
|
|
* You should have received a copy of the GNU General Public
|
|
|
|
* License along with this program; if not, write to the
|
|
|
|
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
|
|
|
|
* Boston, MA 021110-1307, USA.
|
|
|
|
*/
|
|
|
|
|
2008-01-09 03:46:30 +07:00
|
|
|
#ifndef __BTRFS_CTREE__
|
|
|
|
#define __BTRFS_CTREE__
|
2007-02-02 21:18:22 +07:00
|
|
|
|
2007-10-16 03:18:55 +07:00
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/highmem.h>
|
2007-03-22 23:13:20 +07:00
|
|
|
#include <linux/fs.h>
|
2011-03-08 20:14:00 +07:00
|
|
|
#include <linux/rwsem.h>
|
2013-08-15 22:11:21 +07:00
|
|
|
#include <linux/semaphore.h>
|
2007-08-30 02:47:34 +07:00
|
|
|
#include <linux/completion.h>
|
2008-03-26 21:28:07 +07:00
|
|
|
#include <linux/backing-dev.h>
|
2008-07-17 23:53:50 +07:00
|
|
|
#include <linux/wait.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 15:04:11 +07:00
|
|
|
#include <linux/slab.h>
|
2011-01-12 16:30:42 +07:00
|
|
|
#include <linux/kobject.h>
|
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 18:18:59 +07:00
|
|
|
#include <trace/events/btrfs.h>
|
2007-10-16 03:14:27 +07:00
|
|
|
#include <asm/kmap_types.h>
|
2011-09-22 02:05:58 +07:00
|
|
|
#include <linux/pagemap.h>
|
2013-01-29 13:04:50 +07:00
|
|
|
#include <linux/btrfs.h>
|
Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.
So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.
Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)
Test command:
# compilebench -D <mnt> -m
Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-05-14 07:29:04 +07:00
|
|
|
#include <linux/workqueue.h>
|
2014-09-23 12:40:08 +07:00
|
|
|
#include <linux/security.h>
|
2008-01-25 04:13:08 +07:00
|
|
|
#include "extent_io.h"
|
2007-10-16 03:14:19 +07:00
|
|
|
#include "extent_map.h"
|
2008-06-12 03:50:36 +07:00
|
|
|
#include "async-thread.h"
|
2007-03-22 23:13:20 +07:00
|
|
|
|
2007-03-17 03:20:31 +07:00
|
|
|
struct btrfs_trans_handle;
|
2007-03-23 02:59:16 +07:00
|
|
|
struct btrfs_transaction;
|
2010-05-16 21:48:46 +07:00
|
|
|
struct btrfs_pending_snapshot;
|
2007-05-03 02:53:43 +07:00
|
|
|
extern struct kmem_cache *btrfs_trans_handle_cachep;
|
|
|
|
extern struct kmem_cache *btrfs_transaction_cachep;
|
|
|
|
extern struct kmem_cache *btrfs_bit_radix_cachep;
|
2007-04-02 21:50:19 +07:00
|
|
|
extern struct kmem_cache *btrfs_path_cachep;
|
2011-01-29 05:05:48 +07:00
|
|
|
extern struct kmem_cache *btrfs_free_space_cachep;
|
2008-07-17 23:53:50 +07:00
|
|
|
struct btrfs_ordered_sum;
|
2007-03-17 03:20:31 +07:00
|
|
|
|
2013-10-09 23:00:56 +07:00
|
|
|
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
|
|
|
|
#define STATIC noinline
|
|
|
|
#else
|
|
|
|
#define STATIC static noinline
|
|
|
|
#endif
|
|
|
|
|
2013-02-20 07:55:13 +07:00
|
|
|
#define BTRFS_MAGIC 0x4D5F53665248425FULL /* ascii _BHRfS_M, no null */
|
2007-02-02 21:18:22 +07:00
|
|
|
|
2012-11-06 20:57:46 +07:00
|
|
|
#define BTRFS_MAX_MIRRORS 3
|
2012-03-28 01:21:26 +07:00
|
|
|
|
2009-02-13 02:09:45 +07:00
|
|
|
#define BTRFS_MAX_LEVEL 8
|
2008-03-25 02:01:56 +07:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
#define BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
|
2008-03-25 02:01:56 +07:00
|
|
|
/* holds pointers to all of the tree roots */
|
2007-03-27 17:33:00 +07:00
|
|
|
#define BTRFS_ROOT_TREE_OBJECTID 1ULL
|
2008-03-25 02:01:56 +07:00
|
|
|
|
|
|
|
/* stores information about which extents are in use, and reference counts */
|
2007-06-09 20:22:25 +07:00
|
|
|
#define BTRFS_EXTENT_TREE_OBJECTID 2ULL
|
2008-03-25 02:01:56 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* chunk tree stores translations from logical -> physical block numbering
|
|
|
|
* the super block points to the chunk tree
|
|
|
|
*/
|
2008-03-25 02:02:07 +07:00
|
|
|
#define BTRFS_CHUNK_TREE_OBJECTID 3ULL
|
2008-03-25 02:01:56 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* stores information about which areas of a given device are in use.
|
|
|
|
* one per device. The tree of tree roots points to the device tree
|
|
|
|
*/
|
2008-03-25 02:02:07 +07:00
|
|
|
#define BTRFS_DEV_TREE_OBJECTID 4ULL
|
|
|
|
|
|
|
|
/* one per subvolume, storing files and directories */
|
|
|
|
#define BTRFS_FS_TREE_OBJECTID 5ULL
|
|
|
|
|
|
|
|
/* directory objectid inside the root tree */
|
|
|
|
#define BTRFS_ROOT_TREE_DIR_OBJECTID 6ULL
|
2008-03-25 02:01:56 +07:00
|
|
|
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 04:58:54 +07:00
|
|
|
/* holds checksums of all the data extents */
|
|
|
|
#define BTRFS_CSUM_TREE_OBJECTID 7ULL
|
|
|
|
|
2011-09-13 16:06:07 +07:00
|
|
|
/* holds quota configuration and tracking */
|
|
|
|
#define BTRFS_QUOTA_TREE_OBJECTID 8ULL
|
|
|
|
|
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 22:11:17 +07:00
|
|
|
/* for storing items that use the BTRFS_UUID_KEY* types */
|
|
|
|
#define BTRFS_UUID_TREE_OBJECTID 9ULL
|
|
|
|
|
2013-05-01 00:29:29 +07:00
|
|
|
/* for storing balance parameters in the root tree */
|
|
|
|
#define BTRFS_BALANCE_OBJECTID -4ULL
|
|
|
|
|
2008-07-24 23:17:14 +07:00
|
|
|
/* orhpan objectid for tracking unlinked/truncated files */
|
|
|
|
#define BTRFS_ORPHAN_OBJECTID -5ULL
|
|
|
|
|
2008-09-06 03:13:11 +07:00
|
|
|
/* does write ahead logging to speed up fsyncs */
|
|
|
|
#define BTRFS_TREE_LOG_OBJECTID -6ULL
|
|
|
|
#define BTRFS_TREE_LOG_FIXUP_OBJECTID -7ULL
|
|
|
|
|
2008-09-26 21:04:53 +07:00
|
|
|
/* for space balancing */
|
|
|
|
#define BTRFS_TREE_RELOC_OBJECTID -8ULL
|
|
|
|
#define BTRFS_DATA_RELOC_TREE_OBJECTID -9ULL
|
|
|
|
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 04:58:54 +07:00
|
|
|
/*
|
|
|
|
* extent checksums all have this objectid
|
|
|
|
* this allows them to share the logging tree
|
|
|
|
* for fsyncs
|
|
|
|
*/
|
|
|
|
#define BTRFS_EXTENT_CSUM_OBJECTID -10ULL
|
|
|
|
|
2010-06-22 01:48:16 +07:00
|
|
|
/* For storing free space cache */
|
|
|
|
#define BTRFS_FREE_SPACE_OBJECTID -11ULL
|
|
|
|
|
2011-04-20 09:33:24 +07:00
|
|
|
/*
|
2012-09-06 12:33:37 +07:00
|
|
|
* The inode number assigned to the special inode for storing
|
2011-04-20 09:33:24 +07:00
|
|
|
* free ino cache
|
|
|
|
*/
|
|
|
|
#define BTRFS_FREE_INO_OBJECTID -12ULL
|
|
|
|
|
2008-09-24 00:14:14 +07:00
|
|
|
/* dummy objectid represents multiple objectids */
|
|
|
|
#define BTRFS_MULTIPLE_OBJECTIDS -255ULL
|
|
|
|
|
2008-03-25 02:01:56 +07:00
|
|
|
/*
|
2008-09-06 03:43:53 +07:00
|
|
|
* All files have objectids in this range.
|
2008-03-25 02:01:56 +07:00
|
|
|
*/
|
2007-12-13 23:13:32 +07:00
|
|
|
#define BTRFS_FIRST_FREE_OBJECTID 256ULL
|
2008-09-06 03:43:53 +07:00
|
|
|
#define BTRFS_LAST_FREE_OBJECTID -256ULL
|
2008-04-16 02:41:47 +07:00
|
|
|
#define BTRFS_FIRST_CHUNK_TREE_OBJECTID 256ULL
|
2007-03-14 03:47:54 +07:00
|
|
|
|
2008-03-25 02:01:56 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* the device items go into the chunk tree. The key is in the form
|
|
|
|
* [ 1 BTRFS_DEV_ITEM_KEY device_id ]
|
|
|
|
*/
|
|
|
|
#define BTRFS_DEV_ITEMS_OBJECTID 1ULL
|
|
|
|
|
2009-09-22 02:56:00 +07:00
|
|
|
#define BTRFS_BTREE_INODE_OBJECTID 1
|
|
|
|
|
|
|
|
#define BTRFS_EMPTY_SUBVOL_DIR_OBJECTID 2
|
|
|
|
|
2013-08-20 18:20:08 +07:00
|
|
|
#define BTRFS_DEV_REPLACE_DEVID 0ULL
|
2012-11-05 23:33:06 +07:00
|
|
|
|
2010-08-07 00:21:20 +07:00
|
|
|
/*
|
|
|
|
* the max metadata block size. This limit is somewhat artificial,
|
|
|
|
* but the memmove costs go through the roof for larger blocks.
|
|
|
|
*/
|
|
|
|
#define BTRFS_MAX_METADATA_BLOCKSIZE 65536
|
|
|
|
|
2007-03-22 23:13:20 +07:00
|
|
|
/*
|
|
|
|
* we can actually store much bigger names, but lets not confuse the rest
|
|
|
|
* of linux
|
|
|
|
*/
|
|
|
|
#define BTRFS_NAME_LEN 255
|
|
|
|
|
2012-08-09 01:32:27 +07:00
|
|
|
/*
|
|
|
|
* Theoretical limit is larger, but we keep this down to a sane
|
|
|
|
* value. That should limit greatly the possibility of collisions on
|
|
|
|
* inode ref items.
|
|
|
|
*/
|
|
|
|
#define BTRFS_LINK_MAX 65535U
|
|
|
|
|
2007-03-30 02:15:27 +07:00
|
|
|
/* 32 bytes in various csum fields */
|
|
|
|
#define BTRFS_CSUM_SIZE 32
|
2008-12-02 19:17:45 +07:00
|
|
|
|
|
|
|
/* csum types */
|
|
|
|
#define BTRFS_CSUM_TYPE_CRC32 0
|
|
|
|
|
2015-05-05 09:53:15 +07:00
|
|
|
static int btrfs_csum_sizes[] = { 4 };
|
2008-12-02 19:17:45 +07:00
|
|
|
|
2007-05-10 23:36:17 +07:00
|
|
|
/* four bytes for CRC32 */
|
2007-12-13 02:38:19 +07:00
|
|
|
#define BTRFS_EMPTY_DIR_SIZE 0
|
2007-03-30 02:15:27 +07:00
|
|
|
|
2012-11-06 20:16:24 +07:00
|
|
|
/* spefic to btrfs_map_block(), therefore not in include/linux/blk_types.h */
|
|
|
|
#define REQ_GET_READ_MIRRORS (1 << 30)
|
|
|
|
|
2007-06-08 09:13:21 +07:00
|
|
|
#define BTRFS_FT_UNKNOWN 0
|
|
|
|
#define BTRFS_FT_REG_FILE 1
|
|
|
|
#define BTRFS_FT_DIR 2
|
|
|
|
#define BTRFS_FT_CHRDEV 3
|
|
|
|
#define BTRFS_FT_BLKDEV 4
|
|
|
|
#define BTRFS_FT_FIFO 5
|
|
|
|
#define BTRFS_FT_SOCK 6
|
|
|
|
#define BTRFS_FT_SYMLINK 7
|
2007-11-16 23:45:54 +07:00
|
|
|
#define BTRFS_FT_XATTR 8
|
|
|
|
#define BTRFS_FT_MAX 9
|
2007-06-08 09:13:21 +07:00
|
|
|
|
2012-02-03 17:20:04 +07:00
|
|
|
/* ioprio of readahead is set to idle */
|
|
|
|
#define BTRFS_IOPRIO_READA (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0))
|
|
|
|
|
2013-01-29 17:09:20 +07:00
|
|
|
#define BTRFS_DIRTY_METADATA_THRESH (32 * 1024 * 1024)
|
|
|
|
|
2015-02-12 03:08:59 +07:00
|
|
|
#define BTRFS_MAX_EXTENT_SIZE (128 * 1024 * 1024)
|
|
|
|
|
2007-02-26 22:40:21 +07:00
|
|
|
/*
|
2009-04-03 03:46:06 +07:00
|
|
|
* The key defines the order in the tree, and so it also defines (optimal)
|
|
|
|
* block layout.
|
|
|
|
*
|
|
|
|
* objectid corresponds to the inode number.
|
|
|
|
*
|
|
|
|
* type tells us things about the object, and is a kind of stream selector.
|
|
|
|
* so for a given inode, keys with type of 1 might refer to the inode data,
|
|
|
|
* type of 2 may point to file data in the btree and type == 3 may point to
|
|
|
|
* extents.
|
2007-02-26 22:40:21 +07:00
|
|
|
*
|
|
|
|
* offset is the starting byte offset for this key in the stream.
|
2007-03-13 03:22:34 +07:00
|
|
|
*
|
|
|
|
* btrfs_disk_key is in disk byte order. struct btrfs_key is always
|
|
|
|
* in cpu native order. Otherwise they are identical and their sizes
|
|
|
|
* should be the same (ie both packed)
|
2007-02-26 22:40:21 +07:00
|
|
|
*/
|
2007-03-13 03:22:34 +07:00
|
|
|
struct btrfs_disk_key {
|
|
|
|
__le64 objectid;
|
2007-10-16 03:14:19 +07:00
|
|
|
u8 type;
|
2007-04-18 02:39:32 +07:00
|
|
|
__le64 offset;
|
2007-03-13 03:22:34 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_key {
|
2007-02-02 21:18:22 +07:00
|
|
|
u64 objectid;
|
2007-10-16 03:14:19 +07:00
|
|
|
u8 type;
|
2007-04-18 02:39:32 +07:00
|
|
|
u64 offset;
|
2007-02-02 21:18:22 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2008-03-25 02:01:56 +07:00
|
|
|
struct btrfs_mapping_tree {
|
|
|
|
struct extent_map_tree map_tree;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct btrfs_dev_item {
|
|
|
|
/* the internal btrfs device id */
|
|
|
|
__le64 devid;
|
|
|
|
|
|
|
|
/* size of the device */
|
|
|
|
__le64 total_bytes;
|
|
|
|
|
|
|
|
/* bytes used */
|
|
|
|
__le64 bytes_used;
|
|
|
|
|
|
|
|
/* optimal io alignment for this device */
|
|
|
|
__le32 io_align;
|
|
|
|
|
|
|
|
/* optimal io width for this device */
|
|
|
|
__le32 io_width;
|
|
|
|
|
|
|
|
/* minimal io size for this device */
|
|
|
|
__le32 sector_size;
|
|
|
|
|
|
|
|
/* type and info about this device */
|
|
|
|
__le64 type;
|
|
|
|
|
2008-11-18 09:11:30 +07:00
|
|
|
/* expected generation for this device */
|
|
|
|
__le64 generation;
|
|
|
|
|
2008-12-09 04:40:21 +07:00
|
|
|
/*
|
|
|
|
* starting byte of this partition on the device,
|
2009-04-03 03:46:06 +07:00
|
|
|
* to allow for stripe alignment in the future
|
2008-12-09 04:40:21 +07:00
|
|
|
*/
|
|
|
|
__le64 start_offset;
|
|
|
|
|
2008-04-16 02:41:47 +07:00
|
|
|
/* grouping information for allocation decisions */
|
|
|
|
__le32 dev_group;
|
|
|
|
|
|
|
|
/* seek speed 0-100 where 100 is fastest */
|
|
|
|
u8 seek_speed;
|
|
|
|
|
|
|
|
/* bandwidth 0-100 where 100 is fastest */
|
|
|
|
u8 bandwidth;
|
|
|
|
|
2008-03-25 02:02:07 +07:00
|
|
|
/* btrfs generated uuid for this device */
|
2008-04-16 02:41:47 +07:00
|
|
|
u8 uuid[BTRFS_UUID_SIZE];
|
2008-11-18 09:11:30 +07:00
|
|
|
|
|
|
|
/* uuid of FS who owns this device */
|
|
|
|
u8 fsid[BTRFS_UUID_SIZE];
|
2008-03-25 02:01:56 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_stripe {
|
|
|
|
__le64 devid;
|
|
|
|
__le64 offset;
|
2008-04-16 02:41:47 +07:00
|
|
|
u8 dev_uuid[BTRFS_UUID_SIZE];
|
2008-03-25 02:01:56 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_chunk {
|
2008-04-16 02:41:47 +07:00
|
|
|
/* size of this chunk in bytes */
|
|
|
|
__le64 length;
|
|
|
|
|
|
|
|
/* objectid of the root referencing this chunk */
|
2008-03-25 02:01:56 +07:00
|
|
|
__le64 owner;
|
2008-04-16 02:41:47 +07:00
|
|
|
|
2008-03-25 02:01:56 +07:00
|
|
|
__le64 stripe_len;
|
|
|
|
__le64 type;
|
|
|
|
|
|
|
|
/* optimal io alignment for this chunk */
|
|
|
|
__le32 io_align;
|
|
|
|
|
|
|
|
/* optimal io width for this chunk */
|
|
|
|
__le32 io_width;
|
|
|
|
|
|
|
|
/* minimal io size for this chunk */
|
|
|
|
__le32 sector_size;
|
|
|
|
|
|
|
|
/* 2^16 stripes is quite a lot, a second limit is the size of a single
|
|
|
|
* item in the btree
|
|
|
|
*/
|
|
|
|
__le16 num_stripes;
|
2008-04-16 21:49:51 +07:00
|
|
|
|
|
|
|
/* sub stripes only matter for raid10 */
|
|
|
|
__le16 sub_stripes;
|
2008-03-25 02:01:56 +07:00
|
|
|
struct btrfs_stripe stripe;
|
|
|
|
/* additional stripes go here */
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2010-06-22 01:48:16 +07:00
|
|
|
#define BTRFS_FREE_SPACE_EXTENT 1
|
|
|
|
#define BTRFS_FREE_SPACE_BITMAP 2
|
|
|
|
|
|
|
|
struct btrfs_free_space_entry {
|
|
|
|
__le64 offset;
|
|
|
|
__le64 bytes;
|
|
|
|
u8 type;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_free_space_header {
|
|
|
|
struct btrfs_disk_key location;
|
|
|
|
__le64 generation;
|
|
|
|
__le64 num_entries;
|
|
|
|
__le64 num_bitmaps;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2008-03-25 02:01:56 +07:00
|
|
|
static inline unsigned long btrfs_chunk_item_size(int num_stripes)
|
|
|
|
{
|
|
|
|
BUG_ON(num_stripes == 0);
|
|
|
|
return sizeof(struct btrfs_chunk) +
|
|
|
|
sizeof(struct btrfs_stripe) * (num_stripes - 1);
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
#define BTRFS_HEADER_FLAG_WRITTEN (1ULL << 0)
|
|
|
|
#define BTRFS_HEADER_FLAG_RELOC (1ULL << 1)
|
2011-01-06 18:30:25 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* File system states
|
|
|
|
*/
|
2013-01-29 17:14:48 +07:00
|
|
|
#define BTRFS_FS_STATE_ERROR 0
|
2013-02-21 13:32:52 +07:00
|
|
|
#define BTRFS_FS_STATE_REMOUNTING 1
|
2013-03-12 21:46:08 +07:00
|
|
|
#define BTRFS_FS_STATE_TRANS_ABORTED 2
|
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 15:46:55 +07:00
|
|
|
#define BTRFS_FS_STATE_DEV_REPLACING 3
|
2011-01-06 18:30:25 +07:00
|
|
|
|
2013-01-29 17:14:48 +07:00
|
|
|
/* Super block flags */
|
2011-01-06 18:30:25 +07:00
|
|
|
/* Errors detected */
|
|
|
|
#define BTRFS_SUPER_FLAG_ERROR (1ULL << 2)
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
#define BTRFS_SUPER_FLAG_SEEDING (1ULL << 32)
|
|
|
|
#define BTRFS_SUPER_FLAG_METADUMP (1ULL << 33)
|
|
|
|
|
|
|
|
#define BTRFS_BACKREF_REV_MAX 256
|
|
|
|
#define BTRFS_BACKREF_REV_SHIFT 56
|
|
|
|
#define BTRFS_BACKREF_REV_MASK (((u64)BTRFS_BACKREF_REV_MAX - 1) << \
|
|
|
|
BTRFS_BACKREF_REV_SHIFT)
|
|
|
|
|
|
|
|
#define BTRFS_OLD_BACKREF_REV 0
|
|
|
|
#define BTRFS_MIXED_BACKREF_REV 1
|
2008-04-01 22:21:32 +07:00
|
|
|
|
2007-02-26 22:40:21 +07:00
|
|
|
/*
|
|
|
|
* every tree block (leaf or node) starts with this header.
|
|
|
|
*/
|
2007-03-12 23:29:44 +07:00
|
|
|
struct btrfs_header {
|
2008-04-16 02:41:47 +07:00
|
|
|
/* these first four must match the super block */
|
2007-03-30 02:15:27 +07:00
|
|
|
u8 csum[BTRFS_CSUM_SIZE];
|
2007-10-16 03:14:19 +07:00
|
|
|
u8 fsid[BTRFS_FSID_SIZE]; /* FS specific uuid */
|
2007-10-16 03:15:53 +07:00
|
|
|
__le64 bytenr; /* which block this node is supposed to live in */
|
2008-04-01 22:21:32 +07:00
|
|
|
__le64 flags;
|
2008-04-16 02:41:47 +07:00
|
|
|
|
|
|
|
/* allowed to be different from the super from here on down */
|
|
|
|
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
|
2007-03-24 02:56:19 +07:00
|
|
|
__le64 generation;
|
2007-04-21 07:23:12 +07:00
|
|
|
__le64 owner;
|
2007-10-16 03:14:19 +07:00
|
|
|
__le32 nritems;
|
2007-03-27 20:06:38 +07:00
|
|
|
u8 level;
|
2007-02-02 21:18:22 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
#define BTRFS_NODEPTRS_PER_BLOCK(r) (((r)->nodesize - \
|
2009-01-06 09:25:51 +07:00
|
|
|
sizeof(struct btrfs_header)) / \
|
|
|
|
sizeof(struct btrfs_key_ptr))
|
2007-03-15 01:14:43 +07:00
|
|
|
#define __BTRFS_LEAF_DATA_SIZE(bs) ((bs) - sizeof(struct btrfs_header))
|
2014-06-05 00:22:26 +07:00
|
|
|
#define BTRFS_LEAF_DATA_SIZE(r) (__BTRFS_LEAF_DATA_SIZE(r->nodesize))
|
2014-07-24 22:34:58 +07:00
|
|
|
#define BTRFS_FILE_EXTENT_INLINE_DATA_START \
|
|
|
|
(offsetof(struct btrfs_file_extent_item, disk_bytenr))
|
2007-04-20 00:37:44 +07:00
|
|
|
#define BTRFS_MAX_INLINE_DATA_SIZE(r) (BTRFS_LEAF_DATA_SIZE(r) - \
|
|
|
|
sizeof(struct btrfs_item) - \
|
2014-07-24 22:34:58 +07:00
|
|
|
BTRFS_FILE_EXTENT_INLINE_DATA_START)
|
2009-11-12 16:35:27 +07:00
|
|
|
#define BTRFS_MAX_XATTR_SIZE(r) (BTRFS_LEAF_DATA_SIZE(r) - \
|
|
|
|
sizeof(struct btrfs_item) -\
|
|
|
|
sizeof(struct btrfs_dir_item))
|
2007-02-02 21:18:22 +07:00
|
|
|
|
2008-03-25 02:01:56 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* this is a very generous portion of the super block, giving us
|
|
|
|
* room to translate 14 chunks with 3 stripes each.
|
|
|
|
*/
|
|
|
|
#define BTRFS_SYSTEM_CHUNK_ARRAY_SIZE 2048
|
2008-04-18 21:29:49 +07:00
|
|
|
#define BTRFS_LABEL_SIZE 256
|
2008-03-25 02:01:56 +07:00
|
|
|
|
2011-11-04 02:17:42 +07:00
|
|
|
/*
|
|
|
|
* just in case we somehow lose the roots and are not able to mount,
|
|
|
|
* we store an array of the roots from previous transactions
|
|
|
|
* in the super.
|
|
|
|
*/
|
|
|
|
#define BTRFS_NUM_BACKUP_ROOTS 4
|
|
|
|
struct btrfs_root_backup {
|
|
|
|
__le64 tree_root;
|
|
|
|
__le64 tree_root_gen;
|
|
|
|
|
|
|
|
__le64 chunk_root;
|
|
|
|
__le64 chunk_root_gen;
|
|
|
|
|
|
|
|
__le64 extent_root;
|
|
|
|
__le64 extent_root_gen;
|
|
|
|
|
|
|
|
__le64 fs_root;
|
|
|
|
__le64 fs_root_gen;
|
|
|
|
|
|
|
|
__le64 dev_root;
|
|
|
|
__le64 dev_root_gen;
|
|
|
|
|
|
|
|
__le64 csum_root;
|
|
|
|
__le64 csum_root_gen;
|
|
|
|
|
|
|
|
__le64 total_bytes;
|
|
|
|
__le64 bytes_used;
|
|
|
|
__le64 num_devices;
|
|
|
|
/* future */
|
2012-10-31 22:16:32 +07:00
|
|
|
__le64 unused_64[4];
|
2011-11-04 02:17:42 +07:00
|
|
|
|
|
|
|
u8 tree_root_level;
|
|
|
|
u8 chunk_root_level;
|
|
|
|
u8 extent_root_level;
|
|
|
|
u8 fs_root_level;
|
|
|
|
u8 dev_root_level;
|
|
|
|
u8 csum_root_level;
|
|
|
|
/* future and to align */
|
|
|
|
u8 unused_8[10];
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-02-26 22:40:21 +07:00
|
|
|
/*
|
|
|
|
* the super block basically lists the main trees of the FS
|
|
|
|
* it currently lacks any block count etc etc
|
|
|
|
*/
|
2007-03-13 21:46:10 +07:00
|
|
|
struct btrfs_super_block {
|
2007-03-30 02:15:27 +07:00
|
|
|
u8 csum[BTRFS_CSUM_SIZE];
|
2008-04-01 22:21:32 +07:00
|
|
|
/* the first 4 fields must match struct btrfs_header */
|
2008-11-18 09:11:30 +07:00
|
|
|
u8 fsid[BTRFS_FSID_SIZE]; /* FS specific uuid */
|
2007-10-16 03:15:53 +07:00
|
|
|
__le64 bytenr; /* this block number */
|
2008-04-01 22:21:32 +07:00
|
|
|
__le64 flags;
|
2008-04-16 02:41:47 +07:00
|
|
|
|
|
|
|
/* allowed to be different from the btrfs_header from here own down */
|
2007-03-14 03:47:54 +07:00
|
|
|
__le64 magic;
|
|
|
|
__le64 generation;
|
|
|
|
__le64 root;
|
2008-03-25 02:01:56 +07:00
|
|
|
__le64 chunk_root;
|
2008-09-06 03:13:11 +07:00
|
|
|
__le64 log_root;
|
2008-12-09 04:40:21 +07:00
|
|
|
|
|
|
|
/* this will help find the new super based on the log root */
|
|
|
|
__le64 log_root_transid;
|
2007-10-16 03:15:53 +07:00
|
|
|
__le64 total_bytes;
|
|
|
|
__le64 bytes_used;
|
2007-03-21 22:12:56 +07:00
|
|
|
__le64 root_dir_objectid;
|
2008-03-25 02:02:07 +07:00
|
|
|
__le64 num_devices;
|
2007-10-16 03:14:19 +07:00
|
|
|
__le32 sectorsize;
|
|
|
|
__le32 nodesize;
|
2014-06-05 00:22:26 +07:00
|
|
|
__le32 __unused_leafsize;
|
2007-11-30 23:30:34 +07:00
|
|
|
__le32 stripesize;
|
2008-03-25 02:01:56 +07:00
|
|
|
__le32 sys_chunk_array_size;
|
2008-10-30 01:49:05 +07:00
|
|
|
__le64 chunk_root_generation;
|
2008-12-02 18:36:08 +07:00
|
|
|
__le64 compat_flags;
|
|
|
|
__le64 compat_ro_flags;
|
|
|
|
__le64 incompat_flags;
|
2008-12-02 19:17:45 +07:00
|
|
|
__le16 csum_type;
|
2007-10-16 03:15:53 +07:00
|
|
|
u8 root_level;
|
2008-03-25 02:01:56 +07:00
|
|
|
u8 chunk_root_level;
|
2008-09-06 03:13:11 +07:00
|
|
|
u8 log_root_level;
|
2008-03-25 02:02:07 +07:00
|
|
|
struct btrfs_dev_item dev_item;
|
2008-12-09 04:40:21 +07:00
|
|
|
|
2008-04-18 21:29:49 +07:00
|
|
|
char label[BTRFS_LABEL_SIZE];
|
2008-12-09 04:40:21 +07:00
|
|
|
|
2010-06-22 01:48:16 +07:00
|
|
|
__le64 cache_generation;
|
2013-08-15 22:11:22 +07:00
|
|
|
__le64 uuid_tree_generation;
|
2010-06-22 01:48:16 +07:00
|
|
|
|
2008-12-09 04:40:21 +07:00
|
|
|
/* future expansion */
|
2013-08-15 22:11:22 +07:00
|
|
|
__le64 reserved[30];
|
2008-03-25 02:01:56 +07:00
|
|
|
u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
|
2011-11-04 02:17:42 +07:00
|
|
|
struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS];
|
2007-02-22 05:04:57 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2008-12-02 18:36:08 +07:00
|
|
|
/*
|
|
|
|
* Compat flags that we support. If any incompat flags are set other than the
|
|
|
|
* ones specified below then we will fail to mount
|
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
#define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF (1ULL << 0)
|
2010-06-22 01:48:16 +07:00
|
|
|
#define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL (1ULL << 1)
|
2010-09-17 03:19:09 +07:00
|
|
|
#define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS (1ULL << 2)
|
2010-10-25 14:12:26 +07:00
|
|
|
#define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO (1ULL << 3)
|
2010-08-07 00:21:20 +07:00
|
|
|
/*
|
|
|
|
* some patches floated around with a second compression method
|
|
|
|
* lets save that incompat here for when they do get in
|
|
|
|
* Note we don't actually support it, we're just reserving the
|
|
|
|
* number
|
|
|
|
*/
|
|
|
|
#define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZOv2 (1ULL << 4)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* older kernels tried to do bigger metadata blocks, but the
|
|
|
|
* code was pretty buggy. Lets not let them try anymore.
|
|
|
|
*/
|
|
|
|
#define BTRFS_FEATURE_INCOMPAT_BIG_METADATA (1ULL << 5)
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
|
2012-08-09 01:32:27 +07:00
|
|
|
#define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF (1ULL << 6)
|
2013-01-30 06:40:14 +07:00
|
|
|
#define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7)
|
2013-03-08 02:22:04 +07:00
|
|
|
#define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8)
|
2013-10-22 23:18:51 +07:00
|
|
|
#define BTRFS_FEATURE_INCOMPAT_NO_HOLES (1ULL << 9)
|
2012-08-09 01:32:27 +07:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
#define BTRFS_FEATURE_COMPAT_SUPP 0ULL
|
2013-11-16 03:33:55 +07:00
|
|
|
#define BTRFS_FEATURE_COMPAT_SAFE_SET 0ULL
|
|
|
|
#define BTRFS_FEATURE_COMPAT_SAFE_CLEAR 0ULL
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
#define BTRFS_FEATURE_COMPAT_RO_SUPP 0ULL
|
2013-11-16 03:33:55 +07:00
|
|
|
#define BTRFS_FEATURE_COMPAT_RO_SAFE_SET 0ULL
|
|
|
|
#define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR 0ULL
|
|
|
|
|
2010-06-22 01:48:16 +07:00
|
|
|
#define BTRFS_FEATURE_INCOMPAT_SUPP \
|
|
|
|
(BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF | \
|
2010-09-17 03:19:09 +07:00
|
|
|
BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL | \
|
2010-10-25 14:12:26 +07:00
|
|
|
BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS | \
|
2010-08-07 00:21:20 +07:00
|
|
|
BTRFS_FEATURE_INCOMPAT_BIG_METADATA | \
|
2012-08-09 01:32:27 +07:00
|
|
|
BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO | \
|
2013-01-30 06:40:14 +07:00
|
|
|
BTRFS_FEATURE_INCOMPAT_RAID56 | \
|
2013-03-08 02:22:04 +07:00
|
|
|
BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \
|
2013-10-22 23:18:51 +07:00
|
|
|
BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA | \
|
|
|
|
BTRFS_FEATURE_INCOMPAT_NO_HOLES)
|
2008-12-02 18:36:08 +07:00
|
|
|
|
2013-11-16 03:33:55 +07:00
|
|
|
#define BTRFS_FEATURE_INCOMPAT_SAFE_SET \
|
|
|
|
(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
|
|
|
|
#define BTRFS_FEATURE_INCOMPAT_SAFE_CLEAR 0ULL
|
2008-12-02 18:36:08 +07:00
|
|
|
|
2007-02-26 22:40:21 +07:00
|
|
|
/*
|
2007-03-15 23:56:47 +07:00
|
|
|
* A leaf is full of items. offset and size tell us where to find
|
2007-02-26 22:40:21 +07:00
|
|
|
* the item in the leaf (relative to the start of the data area)
|
|
|
|
*/
|
2007-03-13 07:12:07 +07:00
|
|
|
struct btrfs_item {
|
2007-03-13 03:22:34 +07:00
|
|
|
struct btrfs_disk_key key;
|
2007-03-15 01:14:43 +07:00
|
|
|
__le32 offset;
|
2007-10-16 03:14:19 +07:00
|
|
|
__le32 size;
|
2007-02-02 21:18:22 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-02-26 22:40:21 +07:00
|
|
|
/*
|
|
|
|
* leaves have an item area and a data area:
|
|
|
|
* [item0, item1....itemN] [free space] [dataN...data1, data0]
|
|
|
|
*
|
|
|
|
* The data is separate from the items to get the keys closer together
|
|
|
|
* during searches.
|
|
|
|
*/
|
2007-03-13 21:46:10 +07:00
|
|
|
struct btrfs_leaf {
|
2007-03-12 23:29:44 +07:00
|
|
|
struct btrfs_header header;
|
2007-03-15 01:14:43 +07:00
|
|
|
struct btrfs_item items[];
|
2007-02-02 21:18:22 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-02-26 22:40:21 +07:00
|
|
|
/*
|
|
|
|
* all non-leaf blocks are nodes, they hold only keys and pointers to
|
|
|
|
* other blocks
|
|
|
|
*/
|
2007-03-15 01:14:43 +07:00
|
|
|
struct btrfs_key_ptr {
|
|
|
|
struct btrfs_disk_key key;
|
|
|
|
__le64 blockptr;
|
2007-12-11 21:25:06 +07:00
|
|
|
__le64 generation;
|
2007-03-15 01:14:43 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-03-13 21:46:10 +07:00
|
|
|
struct btrfs_node {
|
2007-03-12 23:29:44 +07:00
|
|
|
struct btrfs_header header;
|
2007-03-15 01:14:43 +07:00
|
|
|
struct btrfs_key_ptr ptrs[];
|
2007-02-02 21:18:22 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-02-26 22:40:21 +07:00
|
|
|
/*
|
2007-03-13 21:46:10 +07:00
|
|
|
* btrfs_paths remember the path taken from the root down to the leaf.
|
|
|
|
* level 0 is always the leaf, and nodes[1...BTRFS_MAX_LEVEL] will point
|
2007-02-26 22:40:21 +07:00
|
|
|
* to any other levels that are present.
|
|
|
|
*
|
|
|
|
* The slots array records the index of the item or block pointer
|
|
|
|
* used while walking the tree.
|
|
|
|
*/
|
2007-03-13 21:46:10 +07:00
|
|
|
struct btrfs_path {
|
2007-10-16 03:14:19 +07:00
|
|
|
struct extent_buffer *nodes[BTRFS_MAX_LEVEL];
|
2007-03-13 21:46:10 +07:00
|
|
|
int slots[BTRFS_MAX_LEVEL];
|
2008-06-26 03:01:30 +07:00
|
|
|
/* if there is real range locking, this locks field will change */
|
|
|
|
int locks[BTRFS_MAX_LEVEL];
|
2007-08-08 02:52:22 +07:00
|
|
|
int reada;
|
2008-06-26 03:01:30 +07:00
|
|
|
/* keep some upper locks as we walk down */
|
2007-08-08 03:15:09 +07:00
|
|
|
int lowest_level;
|
2008-12-10 21:10:46 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* set by btrfs_split_item, tells search_slot to keep all locks
|
|
|
|
* and to force calls to keep space in the nodes
|
|
|
|
*/
|
2009-03-13 22:00:37 +07:00
|
|
|
unsigned int search_for_split:1;
|
|
|
|
unsigned int keep_locks:1;
|
|
|
|
unsigned int skip_locking:1;
|
|
|
|
unsigned int leave_spinning:1;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
unsigned int search_commit_root:1;
|
2014-03-29 04:16:01 +07:00
|
|
|
unsigned int need_commit_sem:1;
|
2014-11-09 15:38:39 +07:00
|
|
|
unsigned int skip_release_on_error:1;
|
2007-02-02 21:18:22 +07:00
|
|
|
};
|
2007-02-24 18:24:44 +07:00
|
|
|
|
2007-03-15 23:56:47 +07:00
|
|
|
/*
|
|
|
|
* items in the extent btree are used to record the objectid of the
|
|
|
|
* owner of the block and the number of references
|
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
|
2007-03-15 23:56:47 +07:00
|
|
|
struct btrfs_extent_item {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
__le64 refs;
|
|
|
|
__le64 generation;
|
|
|
|
__le64 flags;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_extent_item_v0 {
|
2007-03-15 23:56:47 +07:00
|
|
|
__le32 refs;
|
2007-12-11 21:25:06 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
#define BTRFS_MAX_EXTENT_ITEM_SIZE(r) ((BTRFS_LEAF_DATA_SIZE(r) >> 4) - \
|
|
|
|
sizeof(struct btrfs_item))
|
|
|
|
|
|
|
|
#define BTRFS_EXTENT_FLAG_DATA (1ULL << 0)
|
|
|
|
#define BTRFS_EXTENT_FLAG_TREE_BLOCK (1ULL << 1)
|
|
|
|
|
|
|
|
/* following flags only apply to tree blocks */
|
|
|
|
|
|
|
|
/* use full backrefs for extent pointers in the block */
|
|
|
|
#define BTRFS_BLOCK_FLAG_FULL_BACKREF (1ULL << 8)
|
|
|
|
|
2011-03-08 20:14:00 +07:00
|
|
|
/*
|
|
|
|
* this flag is only used internally by scrub and may be changed at any time
|
|
|
|
* it is only declared here to avoid collisions
|
|
|
|
*/
|
|
|
|
#define BTRFS_EXTENT_FLAG_SUPER (1ULL << 48)
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
struct btrfs_tree_block_info {
|
|
|
|
struct btrfs_disk_key key;
|
|
|
|
u8 level;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_extent_data_ref {
|
|
|
|
__le64 root;
|
|
|
|
__le64 objectid;
|
|
|
|
__le64 offset;
|
|
|
|
__le32 count;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_shared_data_ref {
|
|
|
|
__le32 count;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_extent_inline_ref {
|
|
|
|
u8 type;
|
2009-07-22 20:59:00 +07:00
|
|
|
__le64 offset;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
/* old style backrefs item */
|
|
|
|
struct btrfs_extent_ref_v0 {
|
2007-12-11 21:25:06 +07:00
|
|
|
__le64 root;
|
|
|
|
__le64 generation;
|
|
|
|
__le64 objectid;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
__le32 count;
|
2007-03-15 23:56:47 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
|
2008-03-25 02:01:56 +07:00
|
|
|
/* dev extents record free space on individual devices. The owner
|
|
|
|
* field points back to the chunk allocation mapping tree that allocated
|
2008-04-16 02:41:47 +07:00
|
|
|
* the extent. The chunk tree uuid field is a way to double check the owner
|
2008-03-25 02:01:56 +07:00
|
|
|
*/
|
|
|
|
struct btrfs_dev_extent {
|
2008-04-16 02:41:47 +07:00
|
|
|
__le64 chunk_tree;
|
|
|
|
__le64 chunk_objectid;
|
|
|
|
__le64 chunk_offset;
|
2008-03-25 02:01:56 +07:00
|
|
|
__le64 length;
|
2008-04-16 02:41:47 +07:00
|
|
|
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
|
2008-03-25 02:01:56 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-12-13 02:38:19 +07:00
|
|
|
struct btrfs_inode_ref {
|
2008-07-24 23:12:38 +07:00
|
|
|
__le64 index;
|
2007-12-13 02:38:19 +07:00
|
|
|
__le16 name_len;
|
|
|
|
/* name goes here */
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2012-08-09 01:32:27 +07:00
|
|
|
struct btrfs_inode_extref {
|
|
|
|
__le64 parent_objectid;
|
|
|
|
__le64 index;
|
|
|
|
__le16 name_len;
|
|
|
|
__u8 name[0];
|
|
|
|
/* name goes here */
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2008-03-25 02:01:56 +07:00
|
|
|
struct btrfs_timespec {
|
2007-03-30 02:15:27 +07:00
|
|
|
__le64 sec;
|
2007-03-16 06:03:33 +07:00
|
|
|
__le32 nsec;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2009-01-21 22:49:16 +07:00
|
|
|
enum btrfs_compression_type {
|
2010-12-17 13:21:50 +07:00
|
|
|
BTRFS_COMPRESS_NONE = 0,
|
|
|
|
BTRFS_COMPRESS_ZLIB = 1,
|
2010-10-25 14:12:26 +07:00
|
|
|
BTRFS_COMPRESS_LZO = 2,
|
|
|
|
BTRFS_COMPRESS_TYPES = 2,
|
|
|
|
BTRFS_COMPRESS_LAST = 3,
|
2009-01-21 22:49:16 +07:00
|
|
|
};
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 01:49:59 +07:00
|
|
|
|
2007-03-16 06:03:33 +07:00
|
|
|
struct btrfs_inode_item {
|
2008-09-06 03:13:11 +07:00
|
|
|
/* nfs style generation number */
|
2007-03-16 06:03:33 +07:00
|
|
|
__le64 generation;
|
2008-09-06 03:13:11 +07:00
|
|
|
/* transid that last touched this inode */
|
|
|
|
__le64 transid;
|
2007-03-16 06:03:33 +07:00
|
|
|
__le64 size;
|
2008-10-09 22:46:29 +07:00
|
|
|
__le64 nbytes;
|
2007-05-01 02:25:45 +07:00
|
|
|
__le64 block_group;
|
2007-03-16 06:03:33 +07:00
|
|
|
__le32 nlink;
|
|
|
|
__le32 uid;
|
|
|
|
__le32 gid;
|
|
|
|
__le32 mode;
|
2008-03-25 02:01:56 +07:00
|
|
|
__le64 rdev;
|
2008-12-02 18:36:08 +07:00
|
|
|
__le64 flags;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 01:49:59 +07:00
|
|
|
|
2008-12-09 04:40:21 +07:00
|
|
|
/* modification sequence number for NFS */
|
|
|
|
__le64 sequence;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* a little future expansion, for more than this we can
|
|
|
|
* just grow the inode item and version it
|
|
|
|
*/
|
|
|
|
__le64 reserved[4];
|
2008-03-25 02:01:56 +07:00
|
|
|
struct btrfs_timespec atime;
|
|
|
|
struct btrfs_timespec ctime;
|
|
|
|
struct btrfs_timespec mtime;
|
|
|
|
struct btrfs_timespec otime;
|
2007-03-16 06:03:33 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2008-09-06 03:13:11 +07:00
|
|
|
struct btrfs_dir_log_item {
|
|
|
|
__le64 end;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-03-15 23:56:47 +07:00
|
|
|
struct btrfs_dir_item {
|
2007-04-07 02:37:36 +07:00
|
|
|
struct btrfs_disk_key location;
|
2008-09-06 03:13:11 +07:00
|
|
|
__le64 transid;
|
2007-11-16 23:45:54 +07:00
|
|
|
__le16 data_len;
|
2007-03-16 19:46:49 +07:00
|
|
|
__le16 name_len;
|
2007-03-15 23:56:47 +07:00
|
|
|
u8 type;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2010-12-20 15:04:08 +07:00
|
|
|
#define BTRFS_ROOT_SUBVOL_RDONLY (1ULL << 0)
|
|
|
|
|
2014-04-15 21:41:44 +07:00
|
|
|
/*
|
|
|
|
* Internal in-memory flag that a subvolume has been marked for deletion but
|
|
|
|
* still visible as a directory
|
|
|
|
*/
|
|
|
|
#define BTRFS_ROOT_SUBVOL_DEAD (1ULL << 48)
|
|
|
|
|
2007-03-15 23:56:47 +07:00
|
|
|
struct btrfs_root_item {
|
2007-04-07 02:37:36 +07:00
|
|
|
struct btrfs_inode_item inode;
|
2008-10-30 01:49:05 +07:00
|
|
|
__le64 generation;
|
2007-04-07 02:37:36 +07:00
|
|
|
__le64 root_dirid;
|
2007-10-16 03:15:53 +07:00
|
|
|
__le64 bytenr;
|
|
|
|
__le64 byte_limit;
|
|
|
|
__le64 bytes_used;
|
2008-10-31 01:20:02 +07:00
|
|
|
__le64 last_snapshot;
|
2008-12-02 18:36:08 +07:00
|
|
|
__le64 flags;
|
2007-03-15 23:56:47 +07:00
|
|
|
__le32 refs;
|
2007-06-23 01:16:25 +07:00
|
|
|
struct btrfs_disk_key drop_progress;
|
|
|
|
u8 drop_level;
|
2007-10-16 03:15:53 +07:00
|
|
|
u8 level;
|
2012-07-25 22:35:53 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The following fields appear after subvol_uuids+subvol_times
|
|
|
|
* were introduced.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This generation number is used to test if the new fields are valid
|
|
|
|
* and up to date while reading the root item. Everytime the root item
|
|
|
|
* is written out, the "generation" field is copied into this field. If
|
|
|
|
* anyone ever mounted the fs with an older kernel, we will have
|
|
|
|
* mismatching generation values here and thus must invalidate the
|
|
|
|
* new fields. See btrfs_update_root and btrfs_find_last_root for
|
|
|
|
* details.
|
|
|
|
* the offset of generation_v2 is also used as the start for the memset
|
|
|
|
* when invalidating the fields.
|
|
|
|
*/
|
|
|
|
__le64 generation_v2;
|
|
|
|
u8 uuid[BTRFS_UUID_SIZE];
|
|
|
|
u8 parent_uuid[BTRFS_UUID_SIZE];
|
|
|
|
u8 received_uuid[BTRFS_UUID_SIZE];
|
|
|
|
__le64 ctransid; /* updated when an inode changes */
|
|
|
|
__le64 otransid; /* trans when created */
|
|
|
|
__le64 stransid; /* trans when sent. non-zero for received subvol */
|
|
|
|
__le64 rtransid; /* trans when received. non-zero for received subvol */
|
|
|
|
struct btrfs_timespec ctime;
|
|
|
|
struct btrfs_timespec otime;
|
|
|
|
struct btrfs_timespec stime;
|
|
|
|
struct btrfs_timespec rtime;
|
|
|
|
__le64 reserved[8]; /* for future */
|
2007-03-21 01:38:32 +07:00
|
|
|
} __attribute__ ((__packed__));
|
2007-03-15 23:56:47 +07:00
|
|
|
|
2008-11-18 08:37:39 +07:00
|
|
|
/*
|
|
|
|
* this is used for both forward and backward root refs
|
|
|
|
*/
|
|
|
|
struct btrfs_root_ref {
|
|
|
|
__le64 dirid;
|
|
|
|
__le64 sequence;
|
|
|
|
__le16 name_len;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2012-01-17 03:04:48 +07:00
|
|
|
struct btrfs_disk_balance_args {
|
|
|
|
/*
|
|
|
|
* profiles to operate on, single is denoted by
|
|
|
|
* BTRFS_AVAIL_ALLOC_BIT_SINGLE
|
|
|
|
*/
|
|
|
|
__le64 profiles;
|
|
|
|
|
2015-10-20 23:22:13 +07:00
|
|
|
/*
|
|
|
|
* usage filter
|
|
|
|
* BTRFS_BALANCE_ARGS_USAGE with a single value means '0..N'
|
|
|
|
* BTRFS_BALANCE_ARGS_USAGE_RANGE - range syntax, min..max
|
|
|
|
*/
|
|
|
|
union {
|
|
|
|
__le64 usage;
|
|
|
|
struct {
|
|
|
|
__le32 usage_min;
|
|
|
|
__le32 usage_max;
|
|
|
|
};
|
|
|
|
};
|
2012-01-17 03:04:48 +07:00
|
|
|
|
|
|
|
/* devid filter */
|
|
|
|
__le64 devid;
|
|
|
|
|
|
|
|
/* devid subset filter [pstart..pend) */
|
|
|
|
__le64 pstart;
|
|
|
|
__le64 pend;
|
|
|
|
|
|
|
|
/* btrfs virtual address space subset filter [vstart..vend) */
|
|
|
|
__le64 vstart;
|
|
|
|
__le64 vend;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* profile to convert to, single is denoted by
|
|
|
|
* BTRFS_AVAIL_ALLOC_BIT_SINGLE
|
|
|
|
*/
|
|
|
|
__le64 target;
|
|
|
|
|
|
|
|
/* BTRFS_BALANCE_ARGS_* */
|
|
|
|
__le64 flags;
|
|
|
|
|
2015-10-10 22:16:50 +07:00
|
|
|
/*
|
|
|
|
* BTRFS_BALANCE_ARGS_LIMIT with value 'limit'
|
|
|
|
* BTRFS_BALANCE_ARGS_LIMIT_RANGE - the extend version can use minimum
|
|
|
|
* and maximum
|
|
|
|
*/
|
|
|
|
union {
|
|
|
|
__le64 limit;
|
|
|
|
struct {
|
|
|
|
__le32 limit_min;
|
|
|
|
__le32 limit_max;
|
|
|
|
};
|
|
|
|
};
|
2014-05-07 22:37:51 +07:00
|
|
|
|
2015-09-29 05:32:41 +07:00
|
|
|
/*
|
|
|
|
* Process chunks that cross stripes_min..stripes_max devices,
|
|
|
|
* BTRFS_BALANCE_ARGS_STRIPES_RANGE
|
|
|
|
*/
|
|
|
|
__le32 stripes_min;
|
|
|
|
__le32 stripes_max;
|
|
|
|
|
|
|
|
__le64 unused[6];
|
2012-01-17 03:04:48 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* store balance parameters to disk so that balance can be properly
|
|
|
|
* resumed after crash or unmount
|
|
|
|
*/
|
|
|
|
struct btrfs_balance_item {
|
|
|
|
/* BTRFS_BALANCE_* */
|
|
|
|
__le64 flags;
|
|
|
|
|
|
|
|
struct btrfs_disk_balance_args data;
|
|
|
|
struct btrfs_disk_balance_args meta;
|
|
|
|
struct btrfs_disk_balance_args sys;
|
|
|
|
|
|
|
|
__le64 unused[4];
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2008-10-31 01:25:28 +07:00
|
|
|
#define BTRFS_FILE_EXTENT_INLINE 0
|
|
|
|
#define BTRFS_FILE_EXTENT_REG 1
|
|
|
|
#define BTRFS_FILE_EXTENT_PREALLOC 2
|
2007-04-20 00:37:44 +07:00
|
|
|
|
2007-03-21 01:38:32 +07:00
|
|
|
struct btrfs_file_extent_item {
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 01:49:59 +07:00
|
|
|
/*
|
|
|
|
* transaction id that created this extent
|
|
|
|
*/
|
2007-03-27 20:16:29 +07:00
|
|
|
__le64 generation;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 01:49:59 +07:00
|
|
|
/*
|
|
|
|
* max number of bytes to hold this extent in ram
|
|
|
|
* when we split a compressed extent we can't know how big
|
|
|
|
* each of the resulting pieces will be. So, this is
|
|
|
|
* an upper limit on the size of the extent in ram instead of
|
|
|
|
* an exact limit.
|
|
|
|
*/
|
|
|
|
__le64 ram_bytes;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* 32 bits for the various ways we might encode the data,
|
|
|
|
* including compression and encryption. If any of these
|
|
|
|
* are set to something a given disk format doesn't understand
|
|
|
|
* it is treated like an incompat flag for reading and writing,
|
|
|
|
* but not for stat.
|
|
|
|
*/
|
|
|
|
u8 compression;
|
|
|
|
u8 encryption;
|
|
|
|
__le16 other_encoding; /* spare for later use */
|
|
|
|
|
|
|
|
/* are we inline data or a real extent? */
|
2007-04-20 00:37:44 +07:00
|
|
|
u8 type;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 01:49:59 +07:00
|
|
|
|
2007-03-21 01:38:32 +07:00
|
|
|
/*
|
|
|
|
* disk space consumed by the extent, checksum blocks are included
|
|
|
|
* in these numbers
|
2014-07-24 22:34:58 +07:00
|
|
|
*
|
|
|
|
* At this offset in the structure, the inline extent data start.
|
2007-03-21 01:38:32 +07:00
|
|
|
*/
|
2007-10-16 03:15:53 +07:00
|
|
|
__le64 disk_bytenr;
|
|
|
|
__le64 disk_num_bytes;
|
2007-03-21 01:38:32 +07:00
|
|
|
/*
|
2007-03-27 03:00:06 +07:00
|
|
|
* the logical offset in file blocks (no csums)
|
2007-03-21 01:38:32 +07:00
|
|
|
* this extent record is for. This allows a file extent to point
|
|
|
|
* into the middle of an existing extent on disk, sharing it
|
|
|
|
* between two snapshots (useful if some bytes in the middle of the
|
|
|
|
* extent have changed
|
|
|
|
*/
|
|
|
|
__le64 offset;
|
|
|
|
/*
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 01:49:59 +07:00
|
|
|
* the logical number of file blocks (no csums included). This
|
|
|
|
* always reflects the size uncompressed and without encoding.
|
2007-03-21 01:38:32 +07:00
|
|
|
*/
|
2007-10-16 03:15:53 +07:00
|
|
|
__le64 num_bytes;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 01:49:59 +07:00
|
|
|
|
2007-03-21 01:38:32 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-03-30 02:15:27 +07:00
|
|
|
struct btrfs_csum_item {
|
2007-05-10 23:36:17 +07:00
|
|
|
u8 csum;
|
2007-03-30 02:15:27 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2012-05-25 21:06:10 +07:00
|
|
|
struct btrfs_dev_stats_item {
|
|
|
|
/*
|
|
|
|
* grow this item struct at the end for future enhancements and keep
|
|
|
|
* the existing values unchanged
|
|
|
|
*/
|
|
|
|
__le64 values[BTRFS_DEV_STAT_VALUES_MAX];
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2012-11-05 23:26:40 +07:00
|
|
|
#define BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_ALWAYS 0
|
|
|
|
#define BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID 1
|
|
|
|
#define BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED 0
|
|
|
|
#define BTRFS_DEV_REPLACE_ITEM_STATE_STARTED 1
|
|
|
|
#define BTRFS_DEV_REPLACE_ITEM_STATE_SUSPENDED 2
|
|
|
|
#define BTRFS_DEV_REPLACE_ITEM_STATE_FINISHED 3
|
|
|
|
#define BTRFS_DEV_REPLACE_ITEM_STATE_CANCELED 4
|
|
|
|
|
|
|
|
struct btrfs_dev_replace {
|
|
|
|
u64 replace_state; /* see #define above */
|
|
|
|
u64 time_started; /* seconds since 1-Jan-1970 */
|
|
|
|
u64 time_stopped; /* seconds since 1-Jan-1970 */
|
|
|
|
atomic64_t num_write_errors;
|
|
|
|
atomic64_t num_uncorrectable_read_errors;
|
|
|
|
|
|
|
|
u64 cursor_left;
|
|
|
|
u64 committed_cursor_left;
|
|
|
|
u64 cursor_left_last_write_of_item;
|
|
|
|
u64 cursor_right;
|
|
|
|
|
|
|
|
u64 cont_reading_from_srcdev_mode; /* see #define above */
|
|
|
|
|
|
|
|
int is_valid;
|
|
|
|
int item_needs_writeback;
|
|
|
|
struct btrfs_device *srcdev;
|
|
|
|
struct btrfs_device *tgtdev;
|
|
|
|
|
|
|
|
pid_t lock_owner;
|
|
|
|
atomic_t nesting_level;
|
|
|
|
struct mutex lock_finishing_cancel_unmount;
|
|
|
|
struct mutex lock_management_lock;
|
|
|
|
struct mutex lock;
|
|
|
|
|
|
|
|
struct btrfs_scrub_progress scrub_progress;
|
|
|
|
};
|
|
|
|
|
2012-11-05 23:32:20 +07:00
|
|
|
struct btrfs_dev_replace_item {
|
|
|
|
/*
|
|
|
|
* grow this item struct at the end for future enhancements and keep
|
|
|
|
* the existing values unchanged
|
|
|
|
*/
|
|
|
|
__le64 src_devid;
|
|
|
|
__le64 cursor_left;
|
|
|
|
__le64 cursor_right;
|
|
|
|
__le64 cont_reading_from_srcdev_mode;
|
|
|
|
|
|
|
|
__le64 replace_state;
|
|
|
|
__le64 time_started;
|
|
|
|
__le64 time_stopped;
|
|
|
|
__le64 num_write_errors;
|
|
|
|
__le64 num_uncorrectable_read_errors;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2008-03-25 02:01:56 +07:00
|
|
|
/* different types of block groups (and chunks) */
|
2012-01-17 03:04:47 +07:00
|
|
|
#define BTRFS_BLOCK_GROUP_DATA (1ULL << 0)
|
|
|
|
#define BTRFS_BLOCK_GROUP_SYSTEM (1ULL << 1)
|
|
|
|
#define BTRFS_BLOCK_GROUP_METADATA (1ULL << 2)
|
|
|
|
#define BTRFS_BLOCK_GROUP_RAID0 (1ULL << 3)
|
|
|
|
#define BTRFS_BLOCK_GROUP_RAID1 (1ULL << 4)
|
|
|
|
#define BTRFS_BLOCK_GROUP_DUP (1ULL << 5)
|
|
|
|
#define BTRFS_BLOCK_GROUP_RAID10 (1ULL << 6)
|
2013-05-11 18:12:54 +07:00
|
|
|
#define BTRFS_BLOCK_GROUP_RAID5 (1ULL << 7)
|
|
|
|
#define BTRFS_BLOCK_GROUP_RAID6 (1ULL << 8)
|
2014-02-07 20:34:12 +07:00
|
|
|
#define BTRFS_BLOCK_GROUP_RESERVED (BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
|
|
|
|
BTRFS_SPACE_INFO_GLOBAL_RSV)
|
2013-01-17 12:38:51 +07:00
|
|
|
|
|
|
|
enum btrfs_raid_types {
|
|
|
|
BTRFS_RAID_RAID10,
|
|
|
|
BTRFS_RAID_RAID1,
|
|
|
|
BTRFS_RAID_DUP,
|
|
|
|
BTRFS_RAID_RAID0,
|
|
|
|
BTRFS_RAID_SINGLE,
|
2013-02-21 02:06:05 +07:00
|
|
|
BTRFS_RAID_RAID5,
|
|
|
|
BTRFS_RAID_RAID6,
|
2013-01-17 12:38:51 +07:00
|
|
|
BTRFS_NR_RAID_TYPES
|
|
|
|
};
|
2012-01-17 03:04:47 +07:00
|
|
|
|
|
|
|
#define BTRFS_BLOCK_GROUP_TYPE_MASK (BTRFS_BLOCK_GROUP_DATA | \
|
|
|
|
BTRFS_BLOCK_GROUP_SYSTEM | \
|
|
|
|
BTRFS_BLOCK_GROUP_METADATA)
|
|
|
|
|
|
|
|
#define BTRFS_BLOCK_GROUP_PROFILE_MASK (BTRFS_BLOCK_GROUP_RAID0 | \
|
|
|
|
BTRFS_BLOCK_GROUP_RAID1 | \
|
2013-01-30 06:40:14 +07:00
|
|
|
BTRFS_BLOCK_GROUP_RAID5 | \
|
|
|
|
BTRFS_BLOCK_GROUP_RAID6 | \
|
2012-01-17 03:04:47 +07:00
|
|
|
BTRFS_BLOCK_GROUP_DUP | \
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10)
|
2015-01-20 14:11:44 +07:00
|
|
|
#define BTRFS_BLOCK_GROUP_RAID56_MASK (BTRFS_BLOCK_GROUP_RAID5 | \
|
|
|
|
BTRFS_BLOCK_GROUP_RAID6)
|
|
|
|
|
2012-01-17 03:04:47 +07:00
|
|
|
/*
|
|
|
|
* We need a bit for restriper to be able to tell when chunks of type
|
|
|
|
* SINGLE are available. This "extended" profile format is used in
|
|
|
|
* fs_info->avail_*_alloc_bits (in-memory) and balance item fields
|
|
|
|
* (on-disk). The corresponding on-disk bit in chunk.type is reserved
|
|
|
|
* to avoid remappings between two formats in future.
|
|
|
|
*/
|
|
|
|
#define BTRFS_AVAIL_ALLOC_BIT_SINGLE (1ULL << 48)
|
|
|
|
|
2014-02-07 20:34:12 +07:00
|
|
|
/*
|
|
|
|
* A fake block group type that is used to communicate global block reserve
|
|
|
|
* size to userspace via the SPACE_INFO ioctl.
|
|
|
|
*/
|
|
|
|
#define BTRFS_SPACE_INFO_GLOBAL_RSV (1ULL << 49)
|
|
|
|
|
2012-03-27 21:09:16 +07:00
|
|
|
#define BTRFS_EXTENDED_PROFILE_MASK (BTRFS_BLOCK_GROUP_PROFILE_MASK | \
|
|
|
|
BTRFS_AVAIL_ALLOC_BIT_SINGLE)
|
|
|
|
|
|
|
|
static inline u64 chunk_to_extended(u64 flags)
|
|
|
|
{
|
|
|
|
if ((flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0)
|
|
|
|
flags |= BTRFS_AVAIL_ALLOC_BIT_SINGLE;
|
|
|
|
|
|
|
|
return flags;
|
|
|
|
}
|
|
|
|
static inline u64 extended_to_chunk(u64 flags)
|
|
|
|
{
|
|
|
|
return flags & ~BTRFS_AVAIL_ALLOC_BIT_SINGLE;
|
|
|
|
}
|
|
|
|
|
2007-04-27 03:46:15 +07:00
|
|
|
struct btrfs_block_group_item {
|
|
|
|
__le64 used;
|
2008-03-25 02:01:56 +07:00
|
|
|
__le64 chunk_objectid;
|
|
|
|
__le64 flags;
|
2007-04-27 03:46:15 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2015-02-27 15:24:22 +07:00
|
|
|
#define BTRFS_QGROUP_LEVEL_SHIFT 48
|
|
|
|
static inline u64 btrfs_qgroup_level(u64 qgroupid)
|
|
|
|
{
|
|
|
|
return qgroupid >> BTRFS_QGROUP_LEVEL_SHIFT;
|
|
|
|
}
|
|
|
|
|
2011-09-13 16:06:07 +07:00
|
|
|
/*
|
|
|
|
* is subvolume quota turned on?
|
|
|
|
*/
|
|
|
|
#define BTRFS_QGROUP_STATUS_FLAG_ON (1ULL << 0)
|
|
|
|
/*
|
2013-04-25 23:04:51 +07:00
|
|
|
* RESCAN is set during the initialization phase
|
2011-09-13 16:06:07 +07:00
|
|
|
*/
|
2013-04-25 23:04:51 +07:00
|
|
|
#define BTRFS_QGROUP_STATUS_FLAG_RESCAN (1ULL << 1)
|
2011-09-13 16:06:07 +07:00
|
|
|
/*
|
|
|
|
* Some qgroup entries are known to be out of date,
|
|
|
|
* either because the configuration has changed in a way that
|
|
|
|
* makes a rescan necessary, or because the fs has been mounted
|
|
|
|
* with a non-qgroup-aware version.
|
|
|
|
* Turning qouta off and on again makes it inconsistent, too.
|
|
|
|
*/
|
|
|
|
#define BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT (1ULL << 2)
|
|
|
|
|
|
|
|
#define BTRFS_QGROUP_STATUS_VERSION 1
|
|
|
|
|
|
|
|
struct btrfs_qgroup_status_item {
|
|
|
|
__le64 version;
|
|
|
|
/*
|
|
|
|
* the generation is updated during every commit. As older
|
|
|
|
* versions of btrfs are not aware of qgroups, it will be
|
|
|
|
* possible to detect inconsistencies by checking the
|
|
|
|
* generation on mount time
|
|
|
|
*/
|
|
|
|
__le64 generation;
|
|
|
|
|
|
|
|
/* flag definitions see above */
|
|
|
|
__le64 flags;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* only used during scanning to record the progress
|
|
|
|
* of the scan. It contains a logical address
|
|
|
|
*/
|
2013-04-25 23:04:51 +07:00
|
|
|
__le64 rescan;
|
2011-09-13 16:06:07 +07:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_qgroup_info_item {
|
|
|
|
__le64 generation;
|
|
|
|
__le64 rfer;
|
|
|
|
__le64 rfer_cmpr;
|
|
|
|
__le64 excl;
|
|
|
|
__le64 excl_cmpr;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
/* flags definition for qgroup limits */
|
|
|
|
#define BTRFS_QGROUP_LIMIT_MAX_RFER (1ULL << 0)
|
|
|
|
#define BTRFS_QGROUP_LIMIT_MAX_EXCL (1ULL << 1)
|
|
|
|
#define BTRFS_QGROUP_LIMIT_RSV_RFER (1ULL << 2)
|
|
|
|
#define BTRFS_QGROUP_LIMIT_RSV_EXCL (1ULL << 3)
|
|
|
|
#define BTRFS_QGROUP_LIMIT_RFER_CMPR (1ULL << 4)
|
|
|
|
#define BTRFS_QGROUP_LIMIT_EXCL_CMPR (1ULL << 5)
|
|
|
|
|
|
|
|
struct btrfs_qgroup_limit_item {
|
|
|
|
/*
|
|
|
|
* only updated when any of the other values change
|
|
|
|
*/
|
|
|
|
__le64 flags;
|
|
|
|
__le64 max_rfer;
|
|
|
|
__le64 max_excl;
|
|
|
|
__le64 rsv_rfer;
|
|
|
|
__le64 rsv_excl;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2014-05-27 23:59:57 +07:00
|
|
|
/* For raid type sysfs entries */
|
|
|
|
struct raid_kobject {
|
|
|
|
int raid_type;
|
|
|
|
struct kobject kobj;
|
|
|
|
};
|
|
|
|
|
2008-03-25 02:01:59 +07:00
|
|
|
struct btrfs_space_info {
|
2014-01-15 19:00:54 +07:00
|
|
|
spinlock_t lock;
|
2009-02-20 23:00:09 +07:00
|
|
|
|
2010-10-15 01:52:27 +07:00
|
|
|
u64 total_bytes; /* total bytes in the space,
|
|
|
|
this doesn't take mirrors into account */
|
2010-05-16 21:46:24 +07:00
|
|
|
u64 bytes_used; /* total bytes used,
|
2011-04-27 13:28:26 +07:00
|
|
|
this doesn't take mirrors into account */
|
2009-02-20 23:00:09 +07:00
|
|
|
u64 bytes_pinned; /* total bytes pinned, will be freed when the
|
|
|
|
transaction finishes */
|
|
|
|
u64 bytes_reserved; /* total bytes the allocator has reserved for
|
|
|
|
current allocations */
|
|
|
|
u64 bytes_may_use; /* number of bytes that may be used for
|
2009-09-12 03:12:44 +07:00
|
|
|
delalloc/allocations */
|
2014-01-15 19:00:54 +07:00
|
|
|
u64 bytes_readonly; /* total bytes that are read only */
|
|
|
|
|
2015-09-29 22:40:47 +07:00
|
|
|
u64 max_extent_size; /* This will hold the maximum extent size of
|
|
|
|
the space info if we had an ENOSPC in the
|
|
|
|
allocator. */
|
|
|
|
|
2014-01-15 19:00:54 +07:00
|
|
|
unsigned int full:1; /* indicates that we cannot allocate any more
|
|
|
|
chunks for this space */
|
|
|
|
unsigned int chunk_alloc:1; /* set if we are allocating a chunk */
|
|
|
|
|
|
|
|
unsigned int flush:1; /* set if we are trying to make space */
|
|
|
|
|
|
|
|
unsigned int force_alloc; /* set if we need to force a chunk
|
|
|
|
alloc for this space */
|
|
|
|
|
2010-05-16 21:46:24 +07:00
|
|
|
u64 disk_used; /* total bytes used on disk */
|
2010-10-15 01:52:27 +07:00
|
|
|
u64 disk_total; /* total bytes on disk, takes mirrors into
|
|
|
|
account */
|
2009-02-20 23:00:09 +07:00
|
|
|
|
2014-01-15 19:00:54 +07:00
|
|
|
u64 flags;
|
|
|
|
|
2013-06-20 02:00:04 +07:00
|
|
|
/*
|
|
|
|
* bytes_pinned is kept in line with what is actually pinned, as in
|
|
|
|
* we've called update_block_group and dropped the bytes_used counter
|
|
|
|
* and increased the bytes_pinned counter. However this means that
|
|
|
|
* bytes_pinned does not reflect the bytes that will be pinned once the
|
|
|
|
* delayed refs are flushed, so this counter is inc'ed everytime we call
|
|
|
|
* btrfs_free_extent so it is a realtime count of what will be freed
|
|
|
|
* once the transaction is committed. It will be zero'ed everytime the
|
|
|
|
* transaction commits.
|
|
|
|
*/
|
|
|
|
struct percpu_counter total_bytes_pinned;
|
|
|
|
|
2008-03-25 02:01:59 +07:00
|
|
|
struct list_head list;
|
2015-01-16 20:24:40 +07:00
|
|
|
/* Protected by the spinlock 'lock'. */
|
2014-10-31 20:49:34 +07:00
|
|
|
struct list_head ro_bgs;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 00:14:11 +07:00
|
|
|
|
2014-01-15 19:00:54 +07:00
|
|
|
struct rw_semaphore groups_sem;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 00:14:11 +07:00
|
|
|
/* for block groups in our same type */
|
2010-05-16 21:46:24 +07:00
|
|
|
struct list_head block_groups[BTRFS_NR_RAID_TYPES];
|
2011-06-08 03:07:44 +07:00
|
|
|
wait_queue_head_t wait;
|
2013-11-02 00:07:04 +07:00
|
|
|
|
|
|
|
struct kobject kobj;
|
2014-05-27 23:59:57 +07:00
|
|
|
struct kobject *block_group_kobjs[BTRFS_NR_RAID_TYPES];
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 00:14:11 +07:00
|
|
|
};
|
|
|
|
|
2012-09-06 17:02:28 +07:00
|
|
|
#define BTRFS_BLOCK_RSV_GLOBAL 1
|
|
|
|
#define BTRFS_BLOCK_RSV_DELALLOC 2
|
|
|
|
#define BTRFS_BLOCK_RSV_TRANS 3
|
|
|
|
#define BTRFS_BLOCK_RSV_CHUNK 4
|
|
|
|
#define BTRFS_BLOCK_RSV_DELOPS 5
|
|
|
|
#define BTRFS_BLOCK_RSV_EMPTY 6
|
|
|
|
#define BTRFS_BLOCK_RSV_TEMP 7
|
|
|
|
|
2010-05-16 21:46:25 +07:00
|
|
|
struct btrfs_block_rsv {
|
|
|
|
u64 size;
|
|
|
|
u64 reserved;
|
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
spinlock_t lock;
|
2012-09-06 17:02:28 +07:00
|
|
|
unsigned short full;
|
|
|
|
unsigned short type;
|
|
|
|
unsigned short failfast;
|
2010-05-16 21:46:25 +07:00
|
|
|
};
|
|
|
|
|
2009-04-03 20:47:43 +07:00
|
|
|
/*
|
|
|
|
* free clusters are used to claim free space in relatively large chunks,
|
|
|
|
* allowing us to do less seeky writes. They are used for all metadata
|
|
|
|
* allocations and data allocations in ssd mode.
|
|
|
|
*/
|
|
|
|
struct btrfs_free_cluster {
|
|
|
|
spinlock_t lock;
|
|
|
|
spinlock_t refill_lock;
|
|
|
|
struct rb_root root;
|
|
|
|
|
|
|
|
/* largest extent in this cluster */
|
|
|
|
u64 max_size;
|
|
|
|
|
|
|
|
/* first extent starting offset */
|
|
|
|
u64 window_start;
|
|
|
|
|
2015-10-03 02:25:10 +07:00
|
|
|
/* We did a full search and couldn't create a cluster */
|
|
|
|
bool fragmented;
|
|
|
|
|
2009-04-03 20:47:43 +07:00
|
|
|
struct btrfs_block_group_cache *block_group;
|
|
|
|
/*
|
|
|
|
* when a cluster is allocated from a block group, we put the
|
|
|
|
* cluster onto a list in the block group so that it can
|
|
|
|
* be freed before the block group is freed.
|
|
|
|
*/
|
|
|
|
struct list_head block_group_list;
|
2008-03-25 02:01:59 +07:00
|
|
|
};
|
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 08:29:25 +07:00
|
|
|
enum btrfs_caching_type {
|
|
|
|
BTRFS_CACHE_NO = 0,
|
|
|
|
BTRFS_CACHE_STARTED = 1,
|
2011-11-15 01:52:14 +07:00
|
|
|
BTRFS_CACHE_FAST = 2,
|
|
|
|
BTRFS_CACHE_FINISHED = 3,
|
2013-08-05 22:15:21 +07:00
|
|
|
BTRFS_CACHE_ERROR = 4,
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 08:29:25 +07:00
|
|
|
};
|
|
|
|
|
2010-06-22 01:48:16 +07:00
|
|
|
enum btrfs_disk_cache_state {
|
|
|
|
BTRFS_DC_WRITTEN = 0,
|
|
|
|
BTRFS_DC_ERROR = 1,
|
|
|
|
BTRFS_DC_CLEAR = 2,
|
|
|
|
BTRFS_DC_SETUP = 3,
|
|
|
|
};
|
|
|
|
|
2009-09-12 03:11:19 +07:00
|
|
|
struct btrfs_caching_control {
|
|
|
|
struct list_head list;
|
|
|
|
struct mutex mutex;
|
|
|
|
wait_queue_head_t wait;
|
2011-07-01 01:42:28 +07:00
|
|
|
struct btrfs_work work;
|
2009-09-12 03:11:19 +07:00
|
|
|
struct btrfs_block_group_cache *block_group;
|
|
|
|
u64 progress;
|
|
|
|
atomic_t count;
|
|
|
|
};
|
|
|
|
|
2015-04-07 03:17:20 +07:00
|
|
|
struct btrfs_io_ctl {
|
|
|
|
void *cur, *orig;
|
|
|
|
struct page *page;
|
|
|
|
struct page **pages;
|
|
|
|
struct btrfs_root *root;
|
2015-04-05 07:14:42 +07:00
|
|
|
struct inode *inode;
|
2015-04-07 03:17:20 +07:00
|
|
|
unsigned long size;
|
|
|
|
int index;
|
|
|
|
int num_pages;
|
2015-04-05 07:14:42 +07:00
|
|
|
int entries;
|
|
|
|
int bitmaps;
|
2015-04-07 03:17:20 +07:00
|
|
|
unsigned check_crcs:1;
|
|
|
|
};
|
|
|
|
|
2007-04-27 03:46:15 +07:00
|
|
|
struct btrfs_block_group_cache {
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_block_group_item item;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 08:29:25 +07:00
|
|
|
struct btrfs_fs_info *fs_info;
|
2010-06-22 01:48:16 +07:00
|
|
|
struct inode *inode;
|
2008-07-23 10:06:41 +07:00
|
|
|
spinlock_t lock;
|
2007-11-17 02:57:08 +07:00
|
|
|
u64 pinned;
|
2008-09-26 21:05:48 +07:00
|
|
|
u64 reserved;
|
Btrfs: fix broken free space cache after the system crashed
When we mounted the filesystem after the crash, we got the following
message:
BTRFS error (device xxx): block group xxxx has wrong amount of free space
BTRFS error (device xxx): failed to load free space cache for block group xxx
It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.
There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
space cache
- account the size of the allocated space that is used to store the file
data, if the size is not zero, don't write out the free space cache.
The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 09:42:50 +07:00
|
|
|
u64 delalloc_bytes;
|
2009-09-12 03:11:20 +07:00
|
|
|
u64 bytes_super;
|
2008-03-25 02:01:56 +07:00
|
|
|
u64 flags;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 08:29:25 +07:00
|
|
|
u64 sectorsize;
|
2011-10-06 19:58:24 +07:00
|
|
|
u64 cache_generation;
|
2013-01-30 06:40:14 +07:00
|
|
|
|
Btrfs: fix broken free space cache after the system crashed
When we mounted the filesystem after the crash, we got the following
message:
BTRFS error (device xxx): block group xxxx has wrong amount of free space
BTRFS error (device xxx): failed to load free space cache for block group xxx
It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.
There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
space cache
- account the size of the allocated space that is used to store the file
data, if the size is not zero, don't write out the free space cache.
The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 09:42:50 +07:00
|
|
|
/*
|
|
|
|
* It is just used for the delayed data space allocation because
|
|
|
|
* only the data space allocation and the relative metadata update
|
|
|
|
* can be done cross the transaction.
|
|
|
|
*/
|
|
|
|
struct rw_semaphore data_rwsem;
|
|
|
|
|
2013-01-30 06:40:14 +07:00
|
|
|
/* for raid56, this is a full stripe, without parity */
|
|
|
|
unsigned long full_stripe_len;
|
|
|
|
|
2015-08-05 15:43:27 +07:00
|
|
|
unsigned int ro;
|
2010-11-20 19:03:07 +07:00
|
|
|
unsigned int iref:1;
|
2014-11-26 22:28:51 +07:00
|
|
|
unsigned int has_caching_ctl:1;
|
Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.
If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.
So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.
If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:
checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-28 04:14:15 +07:00
|
|
|
unsigned int removed:1;
|
2010-06-22 01:48:16 +07:00
|
|
|
|
|
|
|
int disk_cache_state;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 00:14:11 +07:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 08:29:25 +07:00
|
|
|
/* cache tracking stuff */
|
|
|
|
int cached;
|
2009-09-12 03:11:19 +07:00
|
|
|
struct btrfs_caching_control *caching_ctl;
|
|
|
|
u64 last_byte_to_unpin;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 08:29:25 +07:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 00:14:11 +07:00
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
|
|
|
|
/* free space cache stuff */
|
2011-03-29 12:46:06 +07:00
|
|
|
struct btrfs_free_space_ctl *free_space_ctl;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 00:14:11 +07:00
|
|
|
|
|
|
|
/* block group cache stuff */
|
|
|
|
struct rb_node cache_node;
|
|
|
|
|
|
|
|
/* for block groups in the same raid type */
|
|
|
|
struct list_head list;
|
2008-12-12 04:30:39 +07:00
|
|
|
|
|
|
|
/* usage count */
|
|
|
|
atomic_t count;
|
2009-04-03 20:47:43 +07:00
|
|
|
|
|
|
|
/* List of struct btrfs_free_clusters for this block group.
|
|
|
|
* Today it will only have one thing on it, but that may change
|
|
|
|
*/
|
|
|
|
struct list_head cluster_list;
|
2012-09-12 03:57:25 +07:00
|
|
|
|
2014-09-18 22:20:02 +07:00
|
|
|
/* For delayed block group creation or deletion of empty block groups */
|
|
|
|
struct list_head bg_list;
|
2014-10-31 20:49:34 +07:00
|
|
|
|
|
|
|
/* For read-only block groups */
|
|
|
|
struct list_head ro_list;
|
Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.
If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.
So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.
If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:
checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-28 04:14:15 +07:00
|
|
|
|
|
|
|
atomic_t trimming;
|
2014-11-18 03:45:48 +07:00
|
|
|
|
|
|
|
/* For dirty block groups */
|
|
|
|
struct list_head dirty_list;
|
2015-04-05 07:14:42 +07:00
|
|
|
struct list_head io_list;
|
|
|
|
|
|
|
|
struct btrfs_io_ctl io_ctl;
|
2007-04-27 03:46:15 +07:00
|
|
|
};
|
2008-03-25 02:01:56 +07:00
|
|
|
|
2012-06-21 16:08:04 +07:00
|
|
|
/* delayed seq elem */
|
|
|
|
struct seq_list {
|
|
|
|
struct list_head list;
|
|
|
|
u64 seq;
|
|
|
|
};
|
|
|
|
|
2015-02-25 21:47:32 +07:00
|
|
|
#define SEQ_LIST_INIT(name) { .list = LIST_HEAD_INIT((name).list), .seq = 0 }
|
|
|
|
|
2013-02-08 04:06:02 +07:00
|
|
|
enum btrfs_orphan_cleanup_state {
|
|
|
|
ORPHAN_CLEANUP_STARTED = 1,
|
|
|
|
ORPHAN_CLEANUP_DONE = 2,
|
|
|
|
};
|
|
|
|
|
2013-01-30 06:40:14 +07:00
|
|
|
/* used by the raid56 code to lock stripes for read/modify/write */
|
|
|
|
struct btrfs_stripe_hash {
|
|
|
|
struct list_head hash_list;
|
|
|
|
wait_queue_head_t wait;
|
|
|
|
spinlock_t lock;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* used by the raid56 code to lock stripes for read/modify/write */
|
|
|
|
struct btrfs_stripe_hash_table {
|
2013-02-01 02:42:09 +07:00
|
|
|
struct list_head stripe_cache;
|
|
|
|
spinlock_t cache_lock;
|
|
|
|
int cache_size;
|
|
|
|
struct btrfs_stripe_hash table[];
|
2013-01-30 06:40:14 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
#define BTRFS_STRIPE_HASH_TABLE_BITS 11
|
|
|
|
|
Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.
So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.
Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)
Test command:
# compilebench -D <mnt> -m
Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-05-14 07:29:04 +07:00
|
|
|
void btrfs_init_async_reclaim_work(struct work_struct *work);
|
|
|
|
|
2012-06-21 16:08:04 +07:00
|
|
|
/* fs_info */
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
struct reloc_control;
|
2008-03-25 02:01:56 +07:00
|
|
|
struct btrfs_device;
|
2008-03-25 02:02:07 +07:00
|
|
|
struct btrfs_fs_devices;
|
2012-01-17 03:04:47 +07:00
|
|
|
struct btrfs_balance_control;
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 17:12:22 +07:00
|
|
|
struct btrfs_delayed_root;
|
2007-03-21 01:38:32 +07:00
|
|
|
struct btrfs_fs_info {
|
2007-10-16 03:14:19 +07:00
|
|
|
u8 fsid[BTRFS_FSID_SIZE];
|
2008-04-16 02:41:47 +07:00
|
|
|
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
|
2007-03-15 23:56:47 +07:00
|
|
|
struct btrfs_root *extent_root;
|
|
|
|
struct btrfs_root *tree_root;
|
2008-03-25 02:01:56 +07:00
|
|
|
struct btrfs_root *chunk_root;
|
|
|
|
struct btrfs_root *dev_root;
|
2008-11-18 09:02:50 +07:00
|
|
|
struct btrfs_root *fs_root;
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 04:58:54 +07:00
|
|
|
struct btrfs_root *csum_root;
|
2011-09-13 17:56:09 +07:00
|
|
|
struct btrfs_root *quota_root;
|
2013-08-15 22:11:19 +07:00
|
|
|
struct btrfs_root *uuid_root;
|
2008-09-06 03:13:11 +07:00
|
|
|
|
|
|
|
/* the log root tree is a directory of all the other log roots */
|
|
|
|
struct btrfs_root *log_root_tree;
|
2009-09-22 02:56:00 +07:00
|
|
|
|
|
|
|
spinlock_t fs_roots_radix_lock;
|
2007-04-09 21:42:37 +07:00
|
|
|
struct radix_tree_root fs_roots_radix;
|
2007-10-16 03:15:26 +07:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 00:14:11 +07:00
|
|
|
/* block group cache stuff */
|
|
|
|
spinlock_t block_group_cache_lock;
|
2012-12-27 16:01:23 +07:00
|
|
|
u64 first_logical_byte;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 00:14:11 +07:00
|
|
|
struct rb_root block_group_cache_tree;
|
|
|
|
|
2011-09-27 04:12:22 +07:00
|
|
|
/* keep track of unallocated space */
|
|
|
|
spinlock_t free_chunk_lock;
|
|
|
|
u64 free_chunk_space;
|
|
|
|
|
2009-09-12 03:11:19 +07:00
|
|
|
struct extent_io_tree freed_extents[2];
|
|
|
|
struct extent_io_tree *pinned_extents;
|
2007-10-16 03:15:26 +07:00
|
|
|
|
2008-03-25 02:01:56 +07:00
|
|
|
/* logical->physical extent mapping */
|
|
|
|
struct btrfs_mapping_tree mapping_tree;
|
|
|
|
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 17:12:22 +07:00
|
|
|
/*
|
|
|
|
* block reservation for extent, checksum, root tree and
|
|
|
|
* delayed dir index item
|
|
|
|
*/
|
2010-05-16 21:46:25 +07:00
|
|
|
struct btrfs_block_rsv global_block_rsv;
|
|
|
|
/* block reservation for delay allocation */
|
|
|
|
struct btrfs_block_rsv delalloc_block_rsv;
|
|
|
|
/* block reservation for metadata operations */
|
|
|
|
struct btrfs_block_rsv trans_block_rsv;
|
|
|
|
/* block reservation for chunk tree */
|
|
|
|
struct btrfs_block_rsv chunk_block_rsv;
|
2011-11-04 09:54:25 +07:00
|
|
|
/* block reservation for delayed operations */
|
|
|
|
struct btrfs_block_rsv delayed_block_rsv;
|
2010-05-16 21:46:25 +07:00
|
|
|
|
|
|
|
struct btrfs_block_rsv empty_block_rsv;
|
|
|
|
|
2007-03-21 02:57:25 +07:00
|
|
|
u64 generation;
|
2007-08-11 03:22:09 +07:00
|
|
|
u64 last_trans_committed;
|
2014-01-23 22:54:11 +07:00
|
|
|
u64 avg_delayed_ref_runtime;
|
2009-03-24 21:24:20 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* this is updated to the current trans every time a full commit
|
|
|
|
* is required instead of the faster short fsync log commits
|
|
|
|
*/
|
|
|
|
u64 last_trans_log_full_commit;
|
2012-03-30 18:58:32 +07:00
|
|
|
unsigned long mount_opt;
|
2014-02-05 21:26:17 +07:00
|
|
|
/*
|
|
|
|
* Track requests for actions that need to be done during transaction
|
|
|
|
* commit (like for some mount options).
|
|
|
|
*/
|
|
|
|
unsigned long pending_changes;
|
2010-12-17 13:21:50 +07:00
|
|
|
unsigned long compress_type:4;
|
2013-08-01 23:14:52 +07:00
|
|
|
int commit_interval;
|
2013-01-29 17:05:05 +07:00
|
|
|
/*
|
|
|
|
* It is a suggestive number, the read side is safe even it gets a
|
|
|
|
* wrong number because we will write out the data into a regular
|
|
|
|
* extent. The write side(mount/remount) is under ->s_umount lock,
|
|
|
|
* so it is also safe.
|
|
|
|
*/
|
2008-01-30 04:03:38 +07:00
|
|
|
u64 max_inline;
|
Btrfs: protect fs_info->alloc_start
fs_info->alloc_start is a 64bits variant, can be accessed by
multi-task, but it is not protected strictly, it can be changed
while we are accessing it. On 32bit machine, we will get wrong
value because we access it by two instructions.(In fact, it is
also possible that the same problem happens on the 64bit machine,
because the compiler may split the 64bit operation into two 32bit
operation.)
For example:
Assuming -> alloc_start is 0x0000 0000 0001 0000 at the beginning,
then we remount and set ->alloc_start to 0x0000 0100 0000 0000.
Task0 Task1
load high 32 bits
set high 32 bits
set low 32 bits
load low 32 bits
Task1 will get 0.
This patch fixes this problem by using two locks to protect it
fs_info->chunk_mutex
sb->s_umount
On the read side, we just need get one of these two locks, and on
the write side, we must lock all of them.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-29 17:07:33 +07:00
|
|
|
/*
|
|
|
|
* Protected by ->chunk_mutex and sb->s_umount.
|
|
|
|
*
|
|
|
|
* The reason that we use two lock to protect it is because only
|
|
|
|
* remount and mount operations can change it and these two operations
|
|
|
|
* are under sb->s_umount, but the read side (chunk allocation) can not
|
|
|
|
* acquire sb->s_umount or the deadlock would happen. So we use two
|
|
|
|
* locks to protect it. On the write side, we must acquire two locks,
|
|
|
|
* and on the read side, we just need acquire one of them.
|
|
|
|
*/
|
2008-01-02 22:01:11 +07:00
|
|
|
u64 alloc_start;
|
2007-03-23 02:59:16 +07:00
|
|
|
struct btrfs_transaction *running_transaction;
|
2008-07-17 23:53:50 +07:00
|
|
|
wait_queue_head_t transaction_throttle;
|
2008-07-17 23:54:14 +07:00
|
|
|
wait_queue_head_t transaction_wait;
|
2010-10-30 02:37:34 +07:00
|
|
|
wait_queue_head_t transaction_blocked_wait;
|
2008-11-07 10:02:51 +07:00
|
|
|
wait_queue_head_t async_submit_wait;
|
2008-09-06 03:13:11 +07:00
|
|
|
|
2013-04-11 17:30:16 +07:00
|
|
|
/*
|
|
|
|
* Used to protect the incompat_flags, compat_flags, compat_ro_flags
|
|
|
|
* when they are updated.
|
|
|
|
*
|
|
|
|
* Because we do not clear the flags for ever, so we needn't use
|
|
|
|
* the lock on the read side.
|
|
|
|
*
|
|
|
|
* We also needn't use the lock when we mount the fs, because
|
|
|
|
* there is no other task which will update the flag.
|
|
|
|
*/
|
|
|
|
spinlock_t super_lock;
|
2011-04-13 20:41:04 +07:00
|
|
|
struct btrfs_super_block *super_copy;
|
|
|
|
struct btrfs_super_block *super_for_commit;
|
2008-03-25 02:01:56 +07:00
|
|
|
struct block_device *__bdev;
|
2007-03-22 23:13:20 +07:00
|
|
|
struct super_block *sb;
|
2007-03-29 00:57:48 +07:00
|
|
|
struct inode *btree_inode;
|
2008-03-26 21:28:07 +07:00
|
|
|
struct backing_dev_info bdi;
|
2008-09-06 03:13:11 +07:00
|
|
|
struct mutex tree_log_mutex;
|
2008-06-26 03:01:31 +07:00
|
|
|
struct mutex transaction_kthread_mutex;
|
|
|
|
struct mutex cleaner_mutex;
|
2008-06-26 03:01:30 +07:00
|
|
|
struct mutex chunk_mutex;
|
2008-07-09 01:19:17 +07:00
|
|
|
struct mutex volume_mutex;
|
2013-01-30 06:40:14 +07:00
|
|
|
|
2015-04-07 02:46:08 +07:00
|
|
|
/*
|
|
|
|
* this is taken to make sure we don't set block groups ro after
|
|
|
|
* the free space cache has been allocated on them
|
|
|
|
*/
|
|
|
|
struct mutex ro_block_group_mutex;
|
|
|
|
|
2013-01-30 06:40:14 +07:00
|
|
|
/* this is used during read/modify/write to make sure
|
|
|
|
* no two ios are trying to mod the same stripe at the same
|
|
|
|
* time
|
|
|
|
*/
|
|
|
|
struct btrfs_stripe_hash_table *stripe_hash_table;
|
|
|
|
|
2009-04-01 00:27:11 +07:00
|
|
|
/*
|
|
|
|
* this protects the ordered operations list only while we are
|
|
|
|
* processing all of the entries on it. This way we make
|
|
|
|
* sure the commit code doesn't find the list temporarily empty
|
|
|
|
* because another function happens to be doing non-waiting preflush
|
|
|
|
* before jumping into the main commit.
|
|
|
|
*/
|
|
|
|
struct mutex ordered_operations_mutex;
|
2013-08-14 22:33:56 +07:00
|
|
|
|
2014-03-14 02:42:13 +07:00
|
|
|
struct rw_semaphore commit_root_sem;
|
2009-04-01 00:27:11 +07:00
|
|
|
|
2009-11-12 16:34:40 +07:00
|
|
|
struct rw_semaphore cleanup_work_sem;
|
2009-09-22 03:00:26 +07:00
|
|
|
|
2009-11-12 16:34:40 +07:00
|
|
|
struct rw_semaphore subvol_sem;
|
2009-09-22 03:00:26 +07:00
|
|
|
struct srcu_struct subvol_srcu;
|
|
|
|
|
2011-04-12 04:25:13 +07:00
|
|
|
spinlock_t trans_lock;
|
2011-06-14 07:00:16 +07:00
|
|
|
/*
|
|
|
|
* the reloc mutex goes with the trans lock, it is taken
|
|
|
|
* during commit to protect us from the relocation code
|
|
|
|
*/
|
|
|
|
struct mutex reloc_mutex;
|
|
|
|
|
2007-04-20 08:01:03 +07:00
|
|
|
struct list_head trans_list;
|
2007-06-09 05:11:48 +07:00
|
|
|
struct list_head dead_roots;
|
2009-09-12 03:11:19 +07:00
|
|
|
struct list_head caching_block_groups;
|
2008-09-06 03:13:11 +07:00
|
|
|
|
2009-11-12 16:36:34 +07:00
|
|
|
spinlock_t delayed_iput_lock;
|
|
|
|
struct list_head delayed_iputs;
|
2015-02-26 09:49:20 +07:00
|
|
|
struct rw_semaphore delayed_iput_sem;
|
2009-11-12 16:36:34 +07:00
|
|
|
|
2012-05-16 22:55:38 +07:00
|
|
|
/* this protects tree_mod_seq_list */
|
|
|
|
spinlock_t tree_mod_seq_lock;
|
2013-04-24 23:57:33 +07:00
|
|
|
atomic64_t tree_mod_seq;
|
2012-05-16 22:55:38 +07:00
|
|
|
struct list_head tree_mod_seq_list;
|
|
|
|
|
|
|
|
/* this protects tree_mod_log */
|
|
|
|
rwlock_t tree_mod_log_lock;
|
|
|
|
struct rb_root tree_mod_log;
|
|
|
|
|
2008-05-16 03:15:45 +07:00
|
|
|
atomic_t nr_async_submits;
|
2008-09-29 22:19:10 +07:00
|
|
|
atomic_t async_submit_draining;
|
2008-08-16 02:34:15 +07:00
|
|
|
atomic_t nr_async_bios;
|
2008-11-07 10:02:51 +07:00
|
|
|
atomic_t async_delalloc_pages;
|
2011-04-12 04:25:13 +07:00
|
|
|
atomic_t open_ioctl_trans;
|
2008-04-10 03:28:12 +07:00
|
|
|
|
2008-07-24 22:57:52 +07:00
|
|
|
/*
|
2013-05-15 14:48:23 +07:00
|
|
|
* this is used to protect the following list -- ordered_roots.
|
2008-07-24 22:57:52 +07:00
|
|
|
*/
|
2013-05-15 14:48:23 +07:00
|
|
|
spinlock_t ordered_root_lock;
|
2009-04-01 00:27:11 +07:00
|
|
|
|
|
|
|
/*
|
2013-05-15 14:48:23 +07:00
|
|
|
* all fs/file tree roots in which there are data=ordered extents
|
|
|
|
* pending writeback are added into this list.
|
|
|
|
*
|
2009-04-01 00:27:11 +07:00
|
|
|
* these can span multiple transactions and basically include
|
|
|
|
* every dirty data page that isn't from nodatacow
|
|
|
|
*/
|
2013-05-15 14:48:23 +07:00
|
|
|
struct list_head ordered_roots;
|
2009-04-01 00:27:11 +07:00
|
|
|
|
2014-03-06 12:55:03 +07:00
|
|
|
struct mutex delalloc_root_mutex;
|
2013-05-15 14:48:22 +07:00
|
|
|
spinlock_t delalloc_root_lock;
|
|
|
|
/* all fs/file tree roots that have delalloc inodes. */
|
|
|
|
struct list_head delalloc_roots;
|
2008-07-24 22:57:52 +07:00
|
|
|
|
2008-06-12 03:50:36 +07:00
|
|
|
/*
|
|
|
|
* there is a pool of worker threads for checksumming during writes
|
|
|
|
* and a pool for checksumming after reads. This is because readers
|
|
|
|
* can run with FS locks held, and the writers may be waiting for
|
|
|
|
* those locks. We don't want ordering in the pending list to cause
|
|
|
|
* deadlocks, and so the two are serviced separately.
|
2008-06-13 01:46:17 +07:00
|
|
|
*
|
|
|
|
* A third pool does submit_bio to avoid deadlocking with the other
|
|
|
|
* two
|
2008-06-12 03:50:36 +07:00
|
|
|
*/
|
2014-02-28 09:46:19 +07:00
|
|
|
struct btrfs_workqueue *workers;
|
|
|
|
struct btrfs_workqueue *delalloc_workers;
|
|
|
|
struct btrfs_workqueue *flush_workers;
|
|
|
|
struct btrfs_workqueue *endio_workers;
|
|
|
|
struct btrfs_workqueue *endio_meta_workers;
|
|
|
|
struct btrfs_workqueue *endio_raid56_workers;
|
2014-09-12 17:44:03 +07:00
|
|
|
struct btrfs_workqueue *endio_repair_workers;
|
2014-02-28 09:46:19 +07:00
|
|
|
struct btrfs_workqueue *rmw_workers;
|
|
|
|
struct btrfs_workqueue *endio_meta_write_workers;
|
|
|
|
struct btrfs_workqueue *endio_write_workers;
|
|
|
|
struct btrfs_workqueue *endio_freespace_worker;
|
|
|
|
struct btrfs_workqueue *submit_workers;
|
|
|
|
struct btrfs_workqueue *caching_workers;
|
|
|
|
struct btrfs_workqueue *readahead_workers;
|
2011-07-01 01:42:28 +07:00
|
|
|
|
2008-07-17 23:53:51 +07:00
|
|
|
/*
|
|
|
|
* fixup workers take dirty pages that didn't properly go through
|
|
|
|
* the cow mechanism and make them safe to write. It happens
|
|
|
|
* for the sys_munmap function call path
|
|
|
|
*/
|
2014-02-28 09:46:19 +07:00
|
|
|
struct btrfs_workqueue *fixup_workers;
|
|
|
|
struct btrfs_workqueue *delayed_workers;
|
2014-05-23 06:18:52 +07:00
|
|
|
|
|
|
|
/* the extent workers do delayed refs on the extent allocation tree */
|
|
|
|
struct btrfs_workqueue *extent_workers;
|
2008-06-26 03:01:31 +07:00
|
|
|
struct task_struct *transaction_kthread;
|
|
|
|
struct task_struct *cleaner_kthread;
|
2008-06-12 08:47:56 +07:00
|
|
|
int thread_pool_size;
|
2008-06-12 03:50:36 +07:00
|
|
|
|
2013-11-02 00:07:04 +07:00
|
|
|
struct kobject *space_info_kobj;
|
2007-04-21 00:16:02 +07:00
|
|
|
int do_barriers;
|
2007-06-09 05:11:48 +07:00
|
|
|
int closing;
|
2008-09-06 03:13:11 +07:00
|
|
|
int log_root_recovering;
|
2014-09-18 22:20:02 +07:00
|
|
|
int open;
|
2007-03-21 01:38:32 +07:00
|
|
|
|
2007-11-17 02:57:08 +07:00
|
|
|
u64 total_pinned;
|
2009-03-13 22:00:37 +07:00
|
|
|
|
2013-01-29 17:09:20 +07:00
|
|
|
/* used to keep from writing metadata until there is a nice batch */
|
|
|
|
struct percpu_counter dirty_metadata_bytes;
|
2013-01-29 17:10:51 +07:00
|
|
|
struct percpu_counter delalloc_bytes;
|
2013-01-29 17:09:20 +07:00
|
|
|
s32 dirty_metadata_batch;
|
2013-01-29 17:10:51 +07:00
|
|
|
s32 delalloc_batch;
|
|
|
|
|
2008-03-25 02:01:56 +07:00
|
|
|
struct list_head dirty_cowonly_roots;
|
|
|
|
|
2008-03-25 02:02:07 +07:00
|
|
|
struct btrfs_fs_devices *fs_devices;
|
2009-03-10 23:39:20 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* the space_info list is almost entirely read only. It only changes
|
|
|
|
* when we add a new raid type to the FS, and that happens
|
|
|
|
* very rarely. RCU is used to protect it.
|
|
|
|
*/
|
2008-03-25 02:01:59 +07:00
|
|
|
struct list_head space_info;
|
2009-03-10 23:39:20 +07:00
|
|
|
|
2012-07-10 09:21:07 +07:00
|
|
|
struct btrfs_space_info *data_sinfo;
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
struct reloc_control *reloc_ctl;
|
|
|
|
|
2009-04-03 20:47:43 +07:00
|
|
|
/* data_alloc_cluster is only used in ssd mode */
|
|
|
|
struct btrfs_free_cluster data_alloc_cluster;
|
|
|
|
|
|
|
|
/* all metadata allocations go through this cluster */
|
|
|
|
struct btrfs_free_cluster meta_alloc_cluster;
|
2008-04-05 02:40:00 +07:00
|
|
|
|
2011-05-25 02:35:30 +07:00
|
|
|
/* auto defrag inodes go here */
|
|
|
|
spinlock_t defrag_inodes_lock;
|
|
|
|
struct rb_root defrag_inodes;
|
|
|
|
atomic_t defrag_running;
|
|
|
|
|
2013-01-29 17:13:12 +07:00
|
|
|
/* Used to protect avail_{data, metadata, system}_alloc_bits */
|
|
|
|
seqlock_t profiles_lock;
|
2012-01-17 03:04:47 +07:00
|
|
|
/*
|
|
|
|
* these three are in extended format (availability of single
|
|
|
|
* chunks is denoted by BTRFS_AVAIL_ALLOC_BIT_SINGLE bit, other
|
|
|
|
* types are denoted by corresponding BTRFS_BLOCK_GROUP_* bits)
|
|
|
|
*/
|
2008-04-05 02:40:00 +07:00
|
|
|
u64 avail_data_alloc_bits;
|
|
|
|
u64 avail_metadata_alloc_bits;
|
|
|
|
u64 avail_system_alloc_bits;
|
2008-04-29 02:29:42 +07:00
|
|
|
|
2012-01-17 03:04:47 +07:00
|
|
|
/* restriper state */
|
|
|
|
spinlock_t balance_lock;
|
|
|
|
struct mutex balance_mutex;
|
2012-01-17 03:04:49 +07:00
|
|
|
atomic_t balance_running;
|
|
|
|
atomic_t balance_pause_req;
|
2012-01-17 03:04:49 +07:00
|
|
|
atomic_t balance_cancel_req;
|
2012-01-17 03:04:47 +07:00
|
|
|
struct btrfs_balance_control *balance_ctl;
|
2012-01-17 03:04:49 +07:00
|
|
|
wait_queue_head_t balance_wait_q;
|
2012-01-17 03:04:47 +07:00
|
|
|
|
2009-04-22 04:40:57 +07:00
|
|
|
unsigned data_chunk_allocations;
|
|
|
|
unsigned metadata_ratio;
|
|
|
|
|
2008-04-29 02:29:42 +07:00
|
|
|
void *bdev_holder;
|
2011-01-06 18:30:25 +07:00
|
|
|
|
2011-03-08 20:14:00 +07:00
|
|
|
/* private scrub information */
|
|
|
|
struct mutex scrub_lock;
|
|
|
|
atomic_t scrubs_running;
|
|
|
|
atomic_t scrub_pause_req;
|
|
|
|
atomic_t scrubs_paused;
|
|
|
|
atomic_t scrub_cancel_req;
|
|
|
|
wait_queue_head_t scrub_pause_wait;
|
|
|
|
int scrub_workers_refcnt;
|
2014-02-28 09:46:19 +07:00
|
|
|
struct btrfs_workqueue *scrub_workers;
|
|
|
|
struct btrfs_workqueue *scrub_wr_completion_workers;
|
|
|
|
struct btrfs_workqueue *scrub_nocow_workers;
|
2015-06-04 19:09:15 +07:00
|
|
|
struct btrfs_workqueue *scrub_parity_workers;
|
2011-03-08 20:14:00 +07:00
|
|
|
|
2011-11-09 19:44:05 +07:00
|
|
|
#ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
|
|
|
|
u32 check_integrity_print_mask;
|
|
|
|
#endif
|
2011-09-13 17:56:09 +07:00
|
|
|
/*
|
|
|
|
* quota information
|
|
|
|
*/
|
|
|
|
unsigned int quota_enabled:1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* quota_enabled only changes state after a commit. This holds the
|
|
|
|
* next state.
|
|
|
|
*/
|
|
|
|
unsigned int pending_quota_state:1;
|
|
|
|
|
|
|
|
/* is qgroup tracking in a consistent state? */
|
|
|
|
u64 qgroup_flags;
|
|
|
|
|
|
|
|
/* holds configuration and tracking. Protected by qgroup_lock */
|
|
|
|
struct rb_root qgroup_tree;
|
2014-05-14 07:30:47 +07:00
|
|
|
struct rb_root qgroup_op_tree;
|
2011-09-13 17:56:09 +07:00
|
|
|
spinlock_t qgroup_lock;
|
2014-05-14 07:30:47 +07:00
|
|
|
spinlock_t qgroup_op_lock;
|
|
|
|
atomic_t qgroup_op_seq;
|
2011-09-13 17:56:09 +07:00
|
|
|
|
2013-05-06 18:03:27 +07:00
|
|
|
/*
|
|
|
|
* used to avoid frequently calling ulist_alloc()/ulist_free()
|
|
|
|
* when doing qgroup accounting, it must be protected by qgroup_lock.
|
|
|
|
*/
|
|
|
|
struct ulist *qgroup_ulist;
|
|
|
|
|
2013-04-07 17:50:16 +07:00
|
|
|
/* protect user change for quota operations */
|
|
|
|
struct mutex qgroup_ioctl_lock;
|
|
|
|
|
2011-09-13 17:56:09 +07:00
|
|
|
/* list of dirty qgroups to be written at next commit */
|
|
|
|
struct list_head dirty_qgroups;
|
|
|
|
|
2015-04-17 09:23:16 +07:00
|
|
|
/* used by qgroup for an efficient tree traversal */
|
2011-09-13 17:56:09 +07:00
|
|
|
u64 qgroup_seq;
|
2011-11-09 19:44:05 +07:00
|
|
|
|
2013-04-25 23:04:51 +07:00
|
|
|
/* qgroup rescan items */
|
|
|
|
struct mutex qgroup_rescan_lock; /* protects the progress item */
|
|
|
|
struct btrfs_key qgroup_rescan_progress;
|
2014-02-28 09:46:19 +07:00
|
|
|
struct btrfs_workqueue *qgroup_rescan_workers;
|
2013-05-07 02:14:17 +07:00
|
|
|
struct completion qgroup_rescan_completion;
|
Btrfs: fix qgroup rescan resume on mount
When called during mount, we cannot start the rescan worker thread until
open_ctree is done. This commit restuctures the qgroup rescan internals to
enable a clean deferral of the rescan resume operation.
First of all, the struct qgroup_rescan is removed, saving us a malloc and
some initialization synchronizations problems. Its only element (the worker
struct) now lives within fs_info just as the rest of the rescan code.
Then setting up a rescan worker is split into several reusable stages.
Currently we have three different rescan startup scenarios:
(A) rescan ioctl
(B) rescan resume by mount
(C) rescan by quota enable
Each case needs its own combination of the four following steps:
(1) set the progress [A, C: zero; B: state of umount]
(2) commit the transaction [A]
(3) set the counters [A, C: zero; B: state of umount]
(4) start worker [A, B, C]
qgroup_rescan_init does step (1). There's no extra function added to commit
a transaction, we've got that already. qgroup_rescan_zero_tracking does
step (3). Step (4) is nothing more than a call to the generic
btrfs_queue_worker.
We also get rid of a double check for the rescan progress during
btrfs_qgroup_account_ref, which is no longer required due to having step 2
from the list above.
As a side effect, this commit prepares to move the rescan start code from
btrfs_run_qgroups (which is run during commit) to a less time critical
section.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-28 22:47:24 +07:00
|
|
|
struct btrfs_work qgroup_rescan_work;
|
2013-04-25 23:04:51 +07:00
|
|
|
|
2011-01-06 18:30:25 +07:00
|
|
|
/* filesystem state */
|
2013-01-29 17:14:48 +07:00
|
|
|
unsigned long fs_state;
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 17:12:22 +07:00
|
|
|
|
|
|
|
struct btrfs_delayed_root *delayed_root;
|
2011-11-04 02:17:42 +07:00
|
|
|
|
2011-05-23 19:30:00 +07:00
|
|
|
/* readahead tree */
|
|
|
|
spinlock_t reada_lock;
|
|
|
|
struct radix_tree_root reada_tree;
|
2011-11-06 15:05:08 +07:00
|
|
|
|
2013-12-17 01:24:27 +07:00
|
|
|
/* Extent buffer radix tree */
|
|
|
|
spinlock_t buffer_lock;
|
|
|
|
struct radix_tree_root buffer_radix;
|
|
|
|
|
2011-11-04 02:17:42 +07:00
|
|
|
/* next backup root to be overwritten */
|
|
|
|
int backup_root_index;
|
2012-08-01 23:56:49 +07:00
|
|
|
|
|
|
|
int num_tolerated_disk_barrier_failures;
|
2012-11-05 23:26:40 +07:00
|
|
|
|
|
|
|
/* device replace state */
|
|
|
|
struct btrfs_dev_replace dev_replace;
|
2012-11-05 23:54:08 +07:00
|
|
|
|
|
|
|
atomic_t mutually_exclusive_operation_running;
|
2013-08-15 22:11:21 +07:00
|
|
|
|
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 15:46:55 +07:00
|
|
|
struct percpu_counter bio_counter;
|
|
|
|
wait_queue_head_t replace_wait;
|
|
|
|
|
2013-08-15 22:11:21 +07:00
|
|
|
struct semaphore uuid_tree_rescan_sem;
|
2013-08-15 22:11:23 +07:00
|
|
|
unsigned int update_uuid_tree_gen:1;
|
Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.
So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.
Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)
Test command:
# compilebench -D <mnt> -m
Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-05-14 07:29:04 +07:00
|
|
|
|
|
|
|
/* Used to reclaim the metadata space in the background. */
|
|
|
|
struct work_struct async_reclaim_work;
|
2014-09-18 22:20:02 +07:00
|
|
|
|
|
|
|
spinlock_t unused_bgs_lock;
|
|
|
|
struct list_head unused_bgs;
|
2015-01-30 02:18:25 +07:00
|
|
|
struct mutex unused_bg_unpin_mutex;
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 06:58:53 +07:00
|
|
|
struct mutex delete_unused_bgs_mutex;
|
2014-09-23 12:40:08 +07:00
|
|
|
|
|
|
|
/* For btrfs to record security options */
|
|
|
|
struct security_mnt_opts security_opts;
|
Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.
If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.
So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.
If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:
checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-28 04:14:15 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Chunks that can't be freed yet (under a trim/discard operation)
|
|
|
|
* and will be latter freed. Protected by fs_info->chunk_mutex.
|
|
|
|
*/
|
|
|
|
struct list_head pinned_chunks;
|
2007-11-17 02:57:08 +07:00
|
|
|
};
|
2008-03-25 02:01:56 +07:00
|
|
|
|
2014-03-06 12:38:19 +07:00
|
|
|
struct btrfs_subvolume_writers {
|
|
|
|
struct percpu_counter counter;
|
|
|
|
wait_queue_head_t wait;
|
|
|
|
};
|
|
|
|
|
2014-04-02 18:51:05 +07:00
|
|
|
/*
|
|
|
|
* The state of btrfs root
|
|
|
|
*/
|
|
|
|
/*
|
|
|
|
* btrfs_record_root_in_trans is a multi-step process,
|
|
|
|
* and it can race with the balancing code. But the
|
|
|
|
* race is very small, and only the first time the root
|
|
|
|
* is added to each transaction. So IN_TRANS_SETUP
|
|
|
|
* is used to tell us when more checks are required
|
|
|
|
*/
|
|
|
|
#define BTRFS_ROOT_IN_TRANS_SETUP 0
|
|
|
|
#define BTRFS_ROOT_REF_COWS 1
|
|
|
|
#define BTRFS_ROOT_TRACK_DIRTY 2
|
|
|
|
#define BTRFS_ROOT_IN_RADIX 3
|
|
|
|
#define BTRFS_ROOT_DUMMY_ROOT 4
|
|
|
|
#define BTRFS_ROOT_ORPHAN_ITEM_INSERTED 5
|
|
|
|
#define BTRFS_ROOT_DEFRAG_RUNNING 6
|
|
|
|
#define BTRFS_ROOT_FORCE_COW 7
|
|
|
|
#define BTRFS_ROOT_MULTI_LOG_TASKS 8
|
2014-12-16 23:54:43 +07:00
|
|
|
#define BTRFS_ROOT_DIRTY 9
|
2014-04-02 18:51:05 +07:00
|
|
|
|
2007-03-21 01:38:32 +07:00
|
|
|
/*
|
|
|
|
* in ram representation of the tree. extent_root is used for all allocations
|
2007-04-26 02:52:25 +07:00
|
|
|
* and for the extent tree extent_root root.
|
2007-03-21 01:38:32 +07:00
|
|
|
*/
|
|
|
|
struct btrfs_root {
|
2007-10-16 03:14:19 +07:00
|
|
|
struct extent_buffer *node;
|
2008-06-26 03:01:30 +07:00
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
struct extent_buffer *commit_root;
|
2008-09-06 03:13:11 +07:00
|
|
|
struct btrfs_root *log_root;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 21:09:34 +07:00
|
|
|
struct btrfs_root *reloc_root;
|
2008-07-29 02:32:19 +07:00
|
|
|
|
2014-04-02 18:51:05 +07:00
|
|
|
unsigned long state;
|
2007-03-15 23:56:47 +07:00
|
|
|
struct btrfs_root_item root_item;
|
|
|
|
struct btrfs_key root_key;
|
2007-03-21 01:38:32 +07:00
|
|
|
struct btrfs_fs_info *fs_info;
|
2008-09-12 03:17:57 +07:00
|
|
|
struct extent_io_tree dirty_log_pages;
|
|
|
|
|
2008-06-26 03:01:30 +07:00
|
|
|
struct mutex objectid_mutex;
|
2009-01-22 00:54:03 +07:00
|
|
|
|
2010-05-16 21:46:25 +07:00
|
|
|
spinlock_t accounting_lock;
|
|
|
|
struct btrfs_block_rsv *block_rsv;
|
|
|
|
|
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 09:06:11 +07:00
|
|
|
/* free ino cache stuff */
|
|
|
|
struct btrfs_free_space_ctl *free_ino_ctl;
|
2014-02-05 08:37:48 +07:00
|
|
|
enum btrfs_caching_type ino_cache_state;
|
|
|
|
spinlock_t ino_cache_lock;
|
|
|
|
wait_queue_head_t ino_cache_wait;
|
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 09:06:11 +07:00
|
|
|
struct btrfs_free_space_ctl *free_ino_pinned;
|
2014-02-05 08:37:48 +07:00
|
|
|
u64 ino_cache_progress;
|
|
|
|
struct inode *ino_cache_inode;
|
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 09:06:11 +07:00
|
|
|
|
2008-09-06 03:13:11 +07:00
|
|
|
struct mutex log_mutex;
|
2009-01-22 00:54:03 +07:00
|
|
|
wait_queue_head_t log_writer_wait;
|
|
|
|
wait_queue_head_t log_commit_wait[2];
|
2014-02-20 17:08:58 +07:00
|
|
|
struct list_head log_ctxs[2];
|
2009-01-22 00:54:03 +07:00
|
|
|
atomic_t log_writers;
|
|
|
|
atomic_t log_commit[2];
|
2012-09-06 17:04:27 +07:00
|
|
|
atomic_t log_batch;
|
2014-02-20 17:08:56 +07:00
|
|
|
int log_transid;
|
2014-02-20 17:08:59 +07:00
|
|
|
/* No matter the commit succeeds or not*/
|
|
|
|
int log_transid_committed;
|
|
|
|
/* Just be updated when the commit succeeds. */
|
2014-02-20 17:08:56 +07:00
|
|
|
int last_log_commit;
|
2009-10-09 02:30:04 +07:00
|
|
|
pid_t log_start_pid;
|
2008-08-05 10:17:27 +07:00
|
|
|
|
2007-04-09 21:42:37 +07:00
|
|
|
u64 objectid;
|
|
|
|
u64 last_trans;
|
2007-10-16 03:14:19 +07:00
|
|
|
|
|
|
|
/* data allocations are done in sectorsize units */
|
|
|
|
u32 sectorsize;
|
|
|
|
|
|
|
|
/* node allocations are done in nodesize units */
|
|
|
|
u32 nodesize;
|
|
|
|
|
2007-11-30 23:30:34 +07:00
|
|
|
u32 stripesize;
|
|
|
|
|
2007-03-21 01:38:32 +07:00
|
|
|
u32 type;
|
2009-09-22 02:56:00 +07:00
|
|
|
|
|
|
|
u64 highest_objectid;
|
2011-06-14 07:00:16 +07:00
|
|
|
|
2014-10-08 03:24:20 +07:00
|
|
|
/* only used with CONFIG_BTRFS_FS_RUN_SANITY_TESTS is enabled */
|
2014-05-08 04:06:09 +07:00
|
|
|
u64 alloc_bytenr;
|
|
|
|
|
2008-06-26 03:01:31 +07:00
|
|
|
u64 defrag_trans_start;
|
2007-08-08 03:15:09 +07:00
|
|
|
struct btrfs_key defrag_progress;
|
2008-05-25 01:04:53 +07:00
|
|
|
struct btrfs_key defrag_max;
|
2007-08-30 02:47:34 +07:00
|
|
|
char *name;
|
2008-03-25 02:01:56 +07:00
|
|
|
|
|
|
|
/* the dirty list is only used by non-reference counted roots */
|
|
|
|
struct list_head dirty_list;
|
2008-07-24 23:17:14 +07:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
struct list_head root_list;
|
|
|
|
|
2012-10-13 02:27:49 +07:00
|
|
|
spinlock_t log_extents_lock[2];
|
|
|
|
struct list_head logged_list[2];
|
|
|
|
|
2010-05-16 21:49:58 +07:00
|
|
|
spinlock_t orphan_lock;
|
2012-05-24 01:26:42 +07:00
|
|
|
atomic_t orphan_inodes;
|
2010-05-16 21:49:58 +07:00
|
|
|
struct btrfs_block_rsv *orphan_block_rsv;
|
|
|
|
int orphan_cleanup_state;
|
2008-11-18 08:42:26 +07:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
spinlock_t inode_lock;
|
|
|
|
/* red-black tree that keeps track of in-memory inodes */
|
|
|
|
struct rb_root inode_tree;
|
|
|
|
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 17:12:22 +07:00
|
|
|
/*
|
|
|
|
* radix tree that keeps track of delayed nodes of every inode,
|
|
|
|
* protected by inode_lock
|
|
|
|
*/
|
|
|
|
struct radix_tree_root delayed_nodes_tree;
|
2008-11-18 08:42:26 +07:00
|
|
|
/*
|
|
|
|
* right now this just gets used so that a root has its own devid
|
|
|
|
* for stat. It may be used for more later
|
|
|
|
*/
|
2011-07-08 02:44:25 +07:00
|
|
|
dev_t anon_dev;
|
2011-11-15 08:48:06 +07:00
|
|
|
|
2012-12-07 16:28:54 +07:00
|
|
|
spinlock_t root_item_lock;
|
2013-05-15 14:48:20 +07:00
|
|
|
atomic_t refs;
|
2013-05-15 14:48:22 +07:00
|
|
|
|
2014-03-06 12:55:03 +07:00
|
|
|
struct mutex delalloc_mutex;
|
2013-05-15 14:48:22 +07:00
|
|
|
spinlock_t delalloc_lock;
|
|
|
|
/*
|
|
|
|
* all of the inodes that have delalloc bytes. It is possible for
|
|
|
|
* this list to be empty even when there is still dirty data=ordered
|
|
|
|
* extents waiting to finish IO.
|
|
|
|
*/
|
|
|
|
struct list_head delalloc_inodes;
|
|
|
|
struct list_head delalloc_root;
|
|
|
|
u64 nr_delalloc_inodes;
|
2014-03-06 12:55:02 +07:00
|
|
|
|
|
|
|
struct mutex ordered_extent_mutex;
|
2013-05-15 14:48:23 +07:00
|
|
|
/*
|
|
|
|
* this is used by the balancing code to wait for all the pending
|
|
|
|
* ordered extents
|
|
|
|
*/
|
|
|
|
spinlock_t ordered_extent_lock;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* all of the data=ordered extents pending writeback
|
|
|
|
* these can span multiple transactions and basically include
|
|
|
|
* every dirty data page that isn't from nodatacow
|
|
|
|
*/
|
|
|
|
struct list_head ordered_extents;
|
|
|
|
struct list_head ordered_root;
|
|
|
|
u64 nr_ordered_extents;
|
2013-12-16 23:34:17 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Number of currently running SEND ioctls to prevent
|
|
|
|
* manipulation with the read-only status via SUBVOL_SETFLAGS
|
|
|
|
*/
|
|
|
|
int send_in_progress;
|
2014-03-06 12:38:19 +07:00
|
|
|
struct btrfs_subvolume_writers *subv_writers;
|
|
|
|
atomic_t will_be_snapshoted;
|
2015-09-08 16:08:38 +07:00
|
|
|
|
|
|
|
/* For qgroup metadata space reserve */
|
|
|
|
atomic_t qgroup_meta_rsv;
|
2007-03-15 23:56:47 +07:00
|
|
|
};
|
|
|
|
|
2011-05-25 02:35:30 +07:00
|
|
|
struct btrfs_ioctl_defrag_range_args {
|
|
|
|
/* start of the defrag operation */
|
|
|
|
__u64 start;
|
|
|
|
|
|
|
|
/* number of bytes to defrag, use (u64)-1 to say all */
|
|
|
|
__u64 len;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* flags for the operation, which can include turning
|
|
|
|
* on compression for this one defrag
|
|
|
|
*/
|
|
|
|
__u64 flags;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* any extent bigger than this will be considered
|
|
|
|
* already defragged. Use 0 to take the kernel default
|
|
|
|
* Use 1 to say every single extent must be rewritten
|
|
|
|
*/
|
|
|
|
__u32 extent_thresh;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* which compression method to use if turning on compression
|
|
|
|
* for this defrag operation. If unspecified, zlib will
|
|
|
|
* be used
|
|
|
|
*/
|
|
|
|
__u32 compress_type;
|
|
|
|
|
|
|
|
/* spare for later */
|
|
|
|
__u32 unused[4];
|
|
|
|
};
|
|
|
|
|
|
|
|
|
2007-03-16 06:03:33 +07:00
|
|
|
/*
|
|
|
|
* inode items have the data typically returned from stat and store other
|
|
|
|
* info about object characteristics. There is one for every file and dir in
|
|
|
|
* the FS
|
|
|
|
*/
|
2007-04-27 03:46:15 +07:00
|
|
|
#define BTRFS_INODE_ITEM_KEY 1
|
2008-11-18 08:37:39 +07:00
|
|
|
#define BTRFS_INODE_REF_KEY 12
|
2012-08-09 01:32:27 +07:00
|
|
|
#define BTRFS_INODE_EXTREF_KEY 13
|
2008-11-18 08:37:39 +07:00
|
|
|
#define BTRFS_XATTR_ITEM_KEY 24
|
|
|
|
#define BTRFS_ORPHAN_ITEM_KEY 48
|
2007-04-27 03:46:15 +07:00
|
|
|
/* reserve 2-15 close to the inode for later flexibility */
|
2007-03-16 06:03:33 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* dir items are the name -> inode pointers in a directory. There is one
|
|
|
|
* for every name in a directory.
|
|
|
|
*/
|
2008-11-18 08:37:39 +07:00
|
|
|
#define BTRFS_DIR_LOG_ITEM_KEY 60
|
|
|
|
#define BTRFS_DIR_LOG_INDEX_KEY 72
|
|
|
|
#define BTRFS_DIR_ITEM_KEY 84
|
|
|
|
#define BTRFS_DIR_INDEX_KEY 96
|
2007-03-16 06:03:33 +07:00
|
|
|
/*
|
2007-04-27 03:46:15 +07:00
|
|
|
* extent data is for file data
|
2007-03-16 06:03:33 +07:00
|
|
|
*/
|
2008-11-18 08:37:39 +07:00
|
|
|
#define BTRFS_EXTENT_DATA_KEY 108
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 04:58:54 +07:00
|
|
|
|
2007-03-30 02:15:27 +07:00
|
|
|
/*
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 04:58:54 +07:00
|
|
|
* extent csums are stored in a separate tree and hold csums for
|
|
|
|
* an entire extent on disk.
|
2007-03-30 02:15:27 +07:00
|
|
|
*/
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 04:58:54 +07:00
|
|
|
#define BTRFS_EXTENT_CSUM_KEY 128
|
2007-03-30 02:15:27 +07:00
|
|
|
|
2007-03-16 06:03:33 +07:00
|
|
|
/*
|
2009-04-03 03:46:06 +07:00
|
|
|
* root items point to tree roots. They are typically in the root
|
2007-03-16 06:03:33 +07:00
|
|
|
* tree used by the super block to find all the other trees
|
|
|
|
*/
|
2008-11-18 08:37:39 +07:00
|
|
|
#define BTRFS_ROOT_ITEM_KEY 132
|
|
|
|
|
|
|
|
/*
|
|
|
|
* root backrefs tie subvols and snapshots to the directory entries that
|
|
|
|
* reference them
|
|
|
|
*/
|
|
|
|
#define BTRFS_ROOT_BACKREF_KEY 144
|
|
|
|
|
|
|
|
/*
|
|
|
|
* root refs make a fast index for listing all of the snapshots and
|
|
|
|
* subvolumes referenced by a given root. They point directly to the
|
|
|
|
* directory item in the root that references the subvol
|
|
|
|
*/
|
|
|
|
#define BTRFS_ROOT_REF_KEY 156
|
|
|
|
|
2007-03-16 06:03:33 +07:00
|
|
|
/*
|
|
|
|
* extent items are in the extent map tree. These record which blocks
|
|
|
|
* are used, and how many references there are to each block
|
|
|
|
*/
|
2008-11-18 08:37:39 +07:00
|
|
|
#define BTRFS_EXTENT_ITEM_KEY 168
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
|
2013-03-08 02:22:04 +07:00
|
|
|
/*
|
|
|
|
* The same as the BTRFS_EXTENT_ITEM_KEY, except it's metadata we already know
|
|
|
|
* the length, so we save the level in key->offset instead of the length.
|
|
|
|
*/
|
|
|
|
#define BTRFS_METADATA_ITEM_KEY 169
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
#define BTRFS_TREE_BLOCK_REF_KEY 176
|
|
|
|
|
|
|
|
#define BTRFS_EXTENT_DATA_REF_KEY 178
|
|
|
|
|
|
|
|
#define BTRFS_EXTENT_REF_V0_KEY 180
|
|
|
|
|
|
|
|
#define BTRFS_SHARED_BLOCK_REF_KEY 182
|
|
|
|
|
|
|
|
#define BTRFS_SHARED_DATA_REF_KEY 184
|
2007-04-27 03:46:15 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* block groups give us hints into the extent allocation trees. Which
|
|
|
|
* blocks are free etc etc
|
|
|
|
*/
|
2008-11-18 08:37:39 +07:00
|
|
|
#define BTRFS_BLOCK_GROUP_ITEM_KEY 192
|
2007-03-21 01:38:32 +07:00
|
|
|
|
2008-11-18 08:37:39 +07:00
|
|
|
#define BTRFS_DEV_EXTENT_KEY 204
|
|
|
|
#define BTRFS_DEV_ITEM_KEY 216
|
|
|
|
#define BTRFS_CHUNK_ITEM_KEY 228
|
2008-03-25 02:01:56 +07:00
|
|
|
|
2011-09-13 16:06:07 +07:00
|
|
|
/*
|
|
|
|
* Records the overall state of the qgroups.
|
|
|
|
* There's only one instance of this key present,
|
|
|
|
* (0, BTRFS_QGROUP_STATUS_KEY, 0)
|
|
|
|
*/
|
|
|
|
#define BTRFS_QGROUP_STATUS_KEY 240
|
|
|
|
/*
|
|
|
|
* Records the currently used space of the qgroup.
|
|
|
|
* One key per qgroup, (0, BTRFS_QGROUP_INFO_KEY, qgroupid).
|
|
|
|
*/
|
|
|
|
#define BTRFS_QGROUP_INFO_KEY 242
|
|
|
|
/*
|
|
|
|
* Contains the user configured limits for the qgroup.
|
|
|
|
* One key per qgroup, (0, BTRFS_QGROUP_LIMIT_KEY, qgroupid).
|
|
|
|
*/
|
|
|
|
#define BTRFS_QGROUP_LIMIT_KEY 244
|
|
|
|
/*
|
|
|
|
* Records the child-parent relationship of qgroups. For
|
|
|
|
* each relation, 2 keys are present:
|
|
|
|
* (childid, BTRFS_QGROUP_RELATION_KEY, parentid)
|
|
|
|
* (parentid, BTRFS_QGROUP_RELATION_KEY, childid)
|
|
|
|
*/
|
|
|
|
#define BTRFS_QGROUP_RELATION_KEY 246
|
|
|
|
|
2012-01-17 03:04:48 +07:00
|
|
|
#define BTRFS_BALANCE_ITEM_KEY 248
|
|
|
|
|
2012-05-25 21:06:10 +07:00
|
|
|
/*
|
|
|
|
* Persistantly stores the io stats in the device tree.
|
|
|
|
* One key for all stats, (0, BTRFS_DEV_STATS_KEY, devid).
|
|
|
|
*/
|
|
|
|
#define BTRFS_DEV_STATS_KEY 249
|
|
|
|
|
2012-11-05 23:32:20 +07:00
|
|
|
/*
|
|
|
|
* Persistantly stores the device replace state in the device tree.
|
|
|
|
* The key is built like this: (0, BTRFS_DEV_REPLACE_KEY, 0).
|
|
|
|
*/
|
|
|
|
#define BTRFS_DEV_REPLACE_KEY 250
|
|
|
|
|
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 22:11:17 +07:00
|
|
|
/*
|
|
|
|
* Stores items that allow to quickly map UUIDs to something else.
|
|
|
|
* These items are part of the filesystem UUID tree.
|
|
|
|
* The key is built like this:
|
|
|
|
* (UUID_upper_64_bits, BTRFS_UUID_KEY*, UUID_lower_64_bits).
|
|
|
|
*/
|
|
|
|
#if BTRFS_UUID_SIZE != 16
|
|
|
|
#error "UUID items require BTRFS_UUID_SIZE == 16!"
|
|
|
|
#endif
|
|
|
|
#define BTRFS_UUID_KEY_SUBVOL 251 /* for UUIDs assigned to subvols */
|
|
|
|
#define BTRFS_UUID_KEY_RECEIVED_SUBVOL 252 /* for UUIDs assigned to
|
|
|
|
* received subvols */
|
|
|
|
|
2007-03-16 06:03:33 +07:00
|
|
|
/*
|
|
|
|
* string items are for debugging. They just store a short string of
|
|
|
|
* data in the FS
|
|
|
|
*/
|
2007-04-27 03:46:15 +07:00
|
|
|
#define BTRFS_STRING_ITEM_KEY 253
|
|
|
|
|
2011-06-28 22:10:37 +07:00
|
|
|
/*
|
|
|
|
* Flags for mount options.
|
|
|
|
*
|
|
|
|
* Note: don't forget to add new options to btrfs_show_options()
|
|
|
|
*/
|
2008-01-09 21:23:21 +07:00
|
|
|
#define BTRFS_MOUNT_NODATASUM (1 << 0)
|
|
|
|
#define BTRFS_MOUNT_NODATACOW (1 << 1)
|
|
|
|
#define BTRFS_MOUNT_NOBARRIER (1 << 2)
|
2008-01-18 22:54:22 +07:00
|
|
|
#define BTRFS_MOUNT_SSD (1 << 3)
|
2008-05-14 00:46:40 +07:00
|
|
|
#define BTRFS_MOUNT_DEGRADED (1 << 4)
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 01:49:59 +07:00
|
|
|
#define BTRFS_MOUNT_COMPRESS (1 << 5)
|
2009-04-03 03:49:40 +07:00
|
|
|
#define BTRFS_MOUNT_NOTREELOG (1 << 6)
|
2009-04-03 03:59:01 +07:00
|
|
|
#define BTRFS_MOUNT_FLUSHONCOMMIT (1 << 7)
|
2009-06-10 07:28:34 +07:00
|
|
|
#define BTRFS_MOUNT_SSD_SPREAD (1 << 8)
|
2009-06-10 20:51:32 +07:00
|
|
|
#define BTRFS_MOUNT_NOSSD (1 << 9)
|
2009-10-14 20:24:59 +07:00
|
|
|
#define BTRFS_MOUNT_DISCARD (1 << 10)
|
2010-01-29 04:18:15 +07:00
|
|
|
#define BTRFS_MOUNT_FORCE_COMPRESS (1 << 11)
|
2010-06-22 01:48:16 +07:00
|
|
|
#define BTRFS_MOUNT_SPACE_CACHE (1 << 12)
|
2010-09-22 01:21:34 +07:00
|
|
|
#define BTRFS_MOUNT_CLEAR_CACHE (1 << 13)
|
2010-10-30 02:46:43 +07:00
|
|
|
#define BTRFS_MOUNT_USER_SUBVOL_RM_ALLOWED (1 << 14)
|
2011-02-17 01:10:41 +07:00
|
|
|
#define BTRFS_MOUNT_ENOSPC_DEBUG (1 << 15)
|
2011-05-25 02:35:30 +07:00
|
|
|
#define BTRFS_MOUNT_AUTO_DEFRAG (1 << 16)
|
2011-06-03 20:36:29 +07:00
|
|
|
#define BTRFS_MOUNT_INODE_MAP_CACHE (1 << 17)
|
2011-11-04 02:17:42 +07:00
|
|
|
#define BTRFS_MOUNT_RECOVERY (1 << 18)
|
2012-01-17 03:04:48 +07:00
|
|
|
#define BTRFS_MOUNT_SKIP_BALANCE (1 << 19)
|
2012-01-17 03:27:58 +07:00
|
|
|
#define BTRFS_MOUNT_CHECK_INTEGRITY (1 << 20)
|
|
|
|
#define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
|
2011-10-04 10:22:31 +07:00
|
|
|
#define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22)
|
2013-08-15 22:11:24 +07:00
|
|
|
#define BTRFS_MOUNT_RESCAN_UUID_TREE (1 << 23)
|
2015-09-24 01:54:14 +07:00
|
|
|
#define BTRFS_MOUNT_FRAGMENT_DATA (1 << 24)
|
|
|
|
#define BTRFS_MOUNT_FRAGMENT_METADATA (1 << 25)
|
2007-12-15 03:30:32 +07:00
|
|
|
|
2013-08-01 23:14:52 +07:00
|
|
|
#define BTRFS_DEFAULT_COMMIT_INTERVAL (30)
|
2013-08-09 04:45:48 +07:00
|
|
|
#define BTRFS_DEFAULT_MAX_INLINE (8192)
|
2013-08-01 23:14:52 +07:00
|
|
|
|
2007-12-15 03:30:32 +07:00
|
|
|
#define btrfs_clear_opt(o, opt) ((o) &= ~BTRFS_MOUNT_##opt)
|
|
|
|
#define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt)
|
2013-02-21 13:32:52 +07:00
|
|
|
#define btrfs_raw_test_opt(o, opt) ((o) & BTRFS_MOUNT_##opt)
|
2007-12-15 03:30:32 +07:00
|
|
|
#define btrfs_test_opt(root, opt) ((root)->fs_info->mount_opt & \
|
|
|
|
BTRFS_MOUNT_##opt)
|
2014-02-05 21:26:17 +07:00
|
|
|
|
2014-04-23 18:33:33 +07:00
|
|
|
#define btrfs_set_and_info(root, opt, fmt, args...) \
|
|
|
|
{ \
|
|
|
|
if (!btrfs_test_opt(root, opt)) \
|
|
|
|
btrfs_info(root->fs_info, fmt, ##args); \
|
|
|
|
btrfs_set_opt(root->fs_info->mount_opt, opt); \
|
|
|
|
}
|
|
|
|
|
|
|
|
#define btrfs_clear_and_info(root, opt, fmt, args...) \
|
|
|
|
{ \
|
|
|
|
if (btrfs_test_opt(root, opt)) \
|
|
|
|
btrfs_info(root->fs_info, fmt, ##args); \
|
|
|
|
btrfs_clear_opt(root->fs_info->mount_opt, opt); \
|
|
|
|
}
|
|
|
|
|
2015-09-24 01:54:14 +07:00
|
|
|
#ifdef CONFIG_BTRFS_DEBUG
|
|
|
|
static inline int
|
|
|
|
btrfs_should_fragment_free_space(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache *block_group)
|
|
|
|
{
|
|
|
|
return (btrfs_test_opt(root, FRAGMENT_METADATA) &&
|
|
|
|
block_group->flags & BTRFS_BLOCK_GROUP_METADATA) ||
|
|
|
|
(btrfs_test_opt(root, FRAGMENT_DATA) &&
|
|
|
|
block_group->flags & BTRFS_BLOCK_GROUP_DATA);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2014-02-05 21:26:17 +07:00
|
|
|
/*
|
|
|
|
* Requests for changes that need to be done during transaction commit.
|
|
|
|
*
|
|
|
|
* Internal mount options that are used for special handling of the real
|
|
|
|
* mount options (eg. cannot be set during remount and have to be set during
|
|
|
|
* transaction commit)
|
|
|
|
*/
|
|
|
|
|
2014-02-05 21:26:17 +07:00
|
|
|
#define BTRFS_PENDING_SET_INODE_MAP_CACHE (0)
|
|
|
|
#define BTRFS_PENDING_CLEAR_INODE_MAP_CACHE (1)
|
2014-11-12 20:24:35 +07:00
|
|
|
#define BTRFS_PENDING_COMMIT (2)
|
2014-02-05 21:26:17 +07:00
|
|
|
|
2014-02-05 21:26:17 +07:00
|
|
|
#define btrfs_test_pending(info, opt) \
|
|
|
|
test_bit(BTRFS_PENDING_##opt, &(info)->pending_changes)
|
|
|
|
#define btrfs_set_pending(info, opt) \
|
|
|
|
set_bit(BTRFS_PENDING_##opt, &(info)->pending_changes)
|
|
|
|
#define btrfs_clear_pending(info, opt) \
|
|
|
|
clear_bit(BTRFS_PENDING_##opt, &(info)->pending_changes)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Helpers for setting pending mount option changes.
|
|
|
|
*
|
|
|
|
* Expects corresponding macros
|
|
|
|
* BTRFS_PENDING_SET_ and CLEAR_ + short mount option name
|
|
|
|
*/
|
|
|
|
#define btrfs_set_pending_and_info(info, opt, fmt, args...) \
|
|
|
|
do { \
|
|
|
|
if (!btrfs_raw_test_opt((info)->mount_opt, opt)) { \
|
|
|
|
btrfs_info((info), fmt, ##args); \
|
|
|
|
btrfs_set_pending((info), SET_##opt); \
|
|
|
|
btrfs_clear_pending((info), CLEAR_##opt); \
|
|
|
|
} \
|
|
|
|
} while(0)
|
|
|
|
|
|
|
|
#define btrfs_clear_pending_and_info(info, opt, fmt, args...) \
|
|
|
|
do { \
|
|
|
|
if (btrfs_raw_test_opt((info)->mount_opt, opt)) { \
|
|
|
|
btrfs_info((info), fmt, ##args); \
|
|
|
|
btrfs_set_pending((info), CLEAR_##opt); \
|
|
|
|
btrfs_clear_pending((info), SET_##opt); \
|
|
|
|
} \
|
|
|
|
} while(0)
|
|
|
|
|
2008-01-09 03:54:37 +07:00
|
|
|
/*
|
|
|
|
* Inode flags
|
|
|
|
*/
|
2008-01-15 01:26:08 +07:00
|
|
|
#define BTRFS_INODE_NODATASUM (1 << 0)
|
|
|
|
#define BTRFS_INODE_NODATACOW (1 << 1)
|
|
|
|
#define BTRFS_INODE_READONLY (1 << 2)
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 01:49:59 +07:00
|
|
|
#define BTRFS_INODE_NOCOMPRESS (1 << 3)
|
2008-10-31 01:25:28 +07:00
|
|
|
#define BTRFS_INODE_PREALLOC (1 << 4)
|
2009-04-17 15:37:41 +07:00
|
|
|
#define BTRFS_INODE_SYNC (1 << 5)
|
|
|
|
#define BTRFS_INODE_IMMUTABLE (1 << 6)
|
|
|
|
#define BTRFS_INODE_APPEND (1 << 7)
|
|
|
|
#define BTRFS_INODE_NODUMP (1 << 8)
|
|
|
|
#define BTRFS_INODE_NOATIME (1 << 9)
|
|
|
|
#define BTRFS_INODE_DIRSYNC (1 << 10)
|
Btrfs: Per file/directory controls for COW and compression
Data compression and data cow are controlled across the entire FS by mount
options right now. ioctls are needed to set this on a per file or per
directory basis. This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.
According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super. However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.
After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.
NOTE:
- The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).
v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
will be screwed by inheritance from parent directory.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-22 17:12:20 +07:00
|
|
|
#define BTRFS_INODE_COMPRESS (1 << 11)
|
2009-04-17 15:37:41 +07:00
|
|
|
|
2011-03-28 09:01:25 +07:00
|
|
|
#define BTRFS_INODE_ROOT_ITEM_INIT (1 << 31)
|
|
|
|
|
2012-03-03 19:40:03 +07:00
|
|
|
struct btrfs_map_token {
|
|
|
|
struct extent_buffer *eb;
|
|
|
|
char *kaddr;
|
|
|
|
unsigned long offset;
|
|
|
|
};
|
|
|
|
|
|
|
|
static inline void btrfs_init_map_token (struct btrfs_map_token *token)
|
|
|
|
{
|
2012-10-16 00:39:33 +07:00
|
|
|
token->kaddr = NULL;
|
2012-03-03 19:40:03 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
/* some macros to generate set/get funcs for the struct fields. This
|
|
|
|
* assumes there is a lefoo_to_cpu for every type, so lets make a simple
|
|
|
|
* one for u8:
|
|
|
|
*/
|
|
|
|
#define le8_to_cpu(v) (v)
|
|
|
|
#define cpu_to_le8(v) (v)
|
|
|
|
#define __le8 u8
|
|
|
|
|
|
|
|
#define read_eb_member(eb, ptr, type, member, result) ( \
|
|
|
|
read_extent_buffer(eb, (char *)(result), \
|
|
|
|
((unsigned long)(ptr)) + \
|
|
|
|
offsetof(type, member), \
|
|
|
|
sizeof(((type *)0)->member)))
|
|
|
|
|
|
|
|
#define write_eb_member(eb, ptr, type, member, result) ( \
|
|
|
|
write_extent_buffer(eb, (char *)(result), \
|
|
|
|
((unsigned long)(ptr)) + \
|
|
|
|
offsetof(type, member), \
|
|
|
|
sizeof(((type *)0)->member)))
|
|
|
|
|
2012-07-10 09:22:35 +07:00
|
|
|
#define DECLARE_BTRFS_SETGET_BITS(bits) \
|
|
|
|
u##bits btrfs_get_token_##bits(struct extent_buffer *eb, void *ptr, \
|
|
|
|
unsigned long off, \
|
|
|
|
struct btrfs_map_token *token); \
|
|
|
|
void btrfs_set_token_##bits(struct extent_buffer *eb, void *ptr, \
|
|
|
|
unsigned long off, u##bits val, \
|
|
|
|
struct btrfs_map_token *token); \
|
|
|
|
static inline u##bits btrfs_get_##bits(struct extent_buffer *eb, void *ptr, \
|
|
|
|
unsigned long off) \
|
|
|
|
{ \
|
|
|
|
return btrfs_get_token_##bits(eb, ptr, off, NULL); \
|
|
|
|
} \
|
|
|
|
static inline void btrfs_set_##bits(struct extent_buffer *eb, void *ptr, \
|
|
|
|
unsigned long off, u##bits val) \
|
|
|
|
{ \
|
|
|
|
btrfs_set_token_##bits(eb, ptr, off, val, NULL); \
|
|
|
|
}
|
|
|
|
|
|
|
|
DECLARE_BTRFS_SETGET_BITS(8)
|
|
|
|
DECLARE_BTRFS_SETGET_BITS(16)
|
|
|
|
DECLARE_BTRFS_SETGET_BITS(32)
|
|
|
|
DECLARE_BTRFS_SETGET_BITS(64)
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
#define BTRFS_SETGET_FUNCS(name, type, member, bits) \
|
2012-07-10 09:22:35 +07:00
|
|
|
static inline u##bits btrfs_##name(struct extent_buffer *eb, type *s) \
|
|
|
|
{ \
|
|
|
|
BUILD_BUG_ON(sizeof(u##bits) != sizeof(((type *)0))->member); \
|
|
|
|
return btrfs_get_##bits(eb, s, offsetof(type, member)); \
|
|
|
|
} \
|
|
|
|
static inline void btrfs_set_##name(struct extent_buffer *eb, type *s, \
|
|
|
|
u##bits val) \
|
|
|
|
{ \
|
|
|
|
BUILD_BUG_ON(sizeof(u##bits) != sizeof(((type *)0))->member); \
|
|
|
|
btrfs_set_##bits(eb, s, offsetof(type, member), val); \
|
|
|
|
} \
|
|
|
|
static inline u##bits btrfs_token_##name(struct extent_buffer *eb, type *s, \
|
|
|
|
struct btrfs_map_token *token) \
|
|
|
|
{ \
|
|
|
|
BUILD_BUG_ON(sizeof(u##bits) != sizeof(((type *)0))->member); \
|
|
|
|
return btrfs_get_token_##bits(eb, s, offsetof(type, member), token); \
|
|
|
|
} \
|
|
|
|
static inline void btrfs_set_token_##name(struct extent_buffer *eb, \
|
|
|
|
type *s, u##bits val, \
|
|
|
|
struct btrfs_map_token *token) \
|
|
|
|
{ \
|
|
|
|
BUILD_BUG_ON(sizeof(u##bits) != sizeof(((type *)0))->member); \
|
|
|
|
btrfs_set_token_##bits(eb, s, offsetof(type, member), val, token); \
|
|
|
|
}
|
2007-10-16 03:14:19 +07:00
|
|
|
|
|
|
|
#define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits) \
|
|
|
|
static inline u##bits btrfs_##name(struct extent_buffer *eb) \
|
|
|
|
{ \
|
2010-08-07 00:21:20 +07:00
|
|
|
type *p = page_address(eb->pages[0]); \
|
2008-02-15 22:40:52 +07:00
|
|
|
u##bits res = le##bits##_to_cpu(p->member); \
|
2007-10-16 03:18:55 +07:00
|
|
|
return res; \
|
2007-10-16 03:14:19 +07:00
|
|
|
} \
|
|
|
|
static inline void btrfs_set_##name(struct extent_buffer *eb, \
|
|
|
|
u##bits val) \
|
|
|
|
{ \
|
2010-08-07 00:21:20 +07:00
|
|
|
type *p = page_address(eb->pages[0]); \
|
2008-02-15 22:40:52 +07:00
|
|
|
p->member = cpu_to_le##bits(val); \
|
2007-10-16 03:14:19 +07:00
|
|
|
}
|
2007-04-27 03:46:15 +07:00
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
#define BTRFS_SETGET_STACK_FUNCS(name, type, member, bits) \
|
|
|
|
static inline u##bits btrfs_##name(type *s) \
|
|
|
|
{ \
|
|
|
|
return le##bits##_to_cpu(s->member); \
|
|
|
|
} \
|
|
|
|
static inline void btrfs_set_##name(type *s, u##bits val) \
|
|
|
|
{ \
|
|
|
|
s->member = cpu_to_le##bits(val); \
|
2007-03-16 06:03:33 +07:00
|
|
|
}
|
|
|
|
|
2008-03-25 02:01:56 +07:00
|
|
|
BTRFS_SETGET_FUNCS(device_type, struct btrfs_dev_item, type, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(device_total_bytes, struct btrfs_dev_item, total_bytes, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(device_bytes_used, struct btrfs_dev_item, bytes_used, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(device_io_align, struct btrfs_dev_item, io_align, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(device_io_width, struct btrfs_dev_item, io_width, 32);
|
2008-12-09 04:40:21 +07:00
|
|
|
BTRFS_SETGET_FUNCS(device_start_offset, struct btrfs_dev_item,
|
|
|
|
start_offset, 64);
|
2008-03-25 02:01:56 +07:00
|
|
|
BTRFS_SETGET_FUNCS(device_sector_size, struct btrfs_dev_item, sector_size, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(device_id, struct btrfs_dev_item, devid, 64);
|
2008-04-16 02:41:47 +07:00
|
|
|
BTRFS_SETGET_FUNCS(device_group, struct btrfs_dev_item, dev_group, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(device_seek_speed, struct btrfs_dev_item, seek_speed, 8);
|
|
|
|
BTRFS_SETGET_FUNCS(device_bandwidth, struct btrfs_dev_item, bandwidth, 8);
|
2008-11-18 09:11:30 +07:00
|
|
|
BTRFS_SETGET_FUNCS(device_generation, struct btrfs_dev_item, generation, 64);
|
2008-03-25 02:01:56 +07:00
|
|
|
|
2008-03-25 02:02:07 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_type, struct btrfs_dev_item, type, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_total_bytes, struct btrfs_dev_item,
|
|
|
|
total_bytes, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_bytes_used, struct btrfs_dev_item,
|
|
|
|
bytes_used, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_io_align, struct btrfs_dev_item,
|
|
|
|
io_align, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_io_width, struct btrfs_dev_item,
|
|
|
|
io_width, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_sector_size, struct btrfs_dev_item,
|
|
|
|
sector_size, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_id, struct btrfs_dev_item, devid, 64);
|
2008-04-16 02:41:47 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_group, struct btrfs_dev_item,
|
|
|
|
dev_group, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_seek_speed, struct btrfs_dev_item,
|
|
|
|
seek_speed, 8);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_bandwidth, struct btrfs_dev_item,
|
|
|
|
bandwidth, 8);
|
2008-11-18 09:11:30 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_generation, struct btrfs_dev_item,
|
|
|
|
generation, 64);
|
2008-03-25 02:02:07 +07:00
|
|
|
|
2013-08-20 18:20:11 +07:00
|
|
|
static inline unsigned long btrfs_device_uuid(struct btrfs_dev_item *d)
|
2008-03-25 02:01:56 +07:00
|
|
|
{
|
2013-08-20 18:20:11 +07:00
|
|
|
return (unsigned long)d + offsetof(struct btrfs_dev_item, uuid);
|
2008-03-25 02:01:56 +07:00
|
|
|
}
|
|
|
|
|
2013-08-20 18:20:12 +07:00
|
|
|
static inline unsigned long btrfs_device_fsid(struct btrfs_dev_item *d)
|
2008-11-18 09:11:30 +07:00
|
|
|
{
|
2013-08-20 18:20:12 +07:00
|
|
|
return (unsigned long)d + offsetof(struct btrfs_dev_item, fsid);
|
2008-11-18 09:11:30 +07:00
|
|
|
}
|
|
|
|
|
2008-04-16 02:41:47 +07:00
|
|
|
BTRFS_SETGET_FUNCS(chunk_length, struct btrfs_chunk, length, 64);
|
2008-03-25 02:01:56 +07:00
|
|
|
BTRFS_SETGET_FUNCS(chunk_owner, struct btrfs_chunk, owner, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(chunk_stripe_len, struct btrfs_chunk, stripe_len, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(chunk_io_align, struct btrfs_chunk, io_align, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(chunk_io_width, struct btrfs_chunk, io_width, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(chunk_sector_size, struct btrfs_chunk, sector_size, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(chunk_type, struct btrfs_chunk, type, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(chunk_num_stripes, struct btrfs_chunk, num_stripes, 16);
|
2008-04-16 21:49:51 +07:00
|
|
|
BTRFS_SETGET_FUNCS(chunk_sub_stripes, struct btrfs_chunk, sub_stripes, 16);
|
2008-03-25 02:01:56 +07:00
|
|
|
BTRFS_SETGET_FUNCS(stripe_devid, struct btrfs_stripe, devid, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(stripe_offset, struct btrfs_stripe, offset, 64);
|
|
|
|
|
2008-04-16 02:41:47 +07:00
|
|
|
static inline char *btrfs_stripe_dev_uuid(struct btrfs_stripe *s)
|
|
|
|
{
|
|
|
|
return (char *)s + offsetof(struct btrfs_stripe, dev_uuid);
|
|
|
|
}
|
|
|
|
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_length, struct btrfs_chunk, length, 64);
|
2008-03-25 02:01:56 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_owner, struct btrfs_chunk, owner, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_stripe_len, struct btrfs_chunk,
|
|
|
|
stripe_len, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_io_align, struct btrfs_chunk,
|
|
|
|
io_align, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_io_width, struct btrfs_chunk,
|
|
|
|
io_width, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_sector_size, struct btrfs_chunk,
|
|
|
|
sector_size, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_type, struct btrfs_chunk, type, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_num_stripes, struct btrfs_chunk,
|
|
|
|
num_stripes, 16);
|
2008-04-16 21:49:51 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_sub_stripes, struct btrfs_chunk,
|
|
|
|
sub_stripes, 16);
|
2008-03-25 02:01:56 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_stripe_devid, struct btrfs_stripe, devid, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_stripe_offset, struct btrfs_stripe, offset, 64);
|
|
|
|
|
|
|
|
static inline struct btrfs_stripe *btrfs_stripe_nr(struct btrfs_chunk *c,
|
|
|
|
int nr)
|
|
|
|
{
|
|
|
|
unsigned long offset = (unsigned long)c;
|
|
|
|
offset += offsetof(struct btrfs_chunk, stripe);
|
|
|
|
offset += nr * sizeof(struct btrfs_stripe);
|
|
|
|
return (struct btrfs_stripe *)offset;
|
|
|
|
}
|
|
|
|
|
2008-04-18 21:29:38 +07:00
|
|
|
static inline char *btrfs_stripe_dev_uuid_nr(struct btrfs_chunk *c, int nr)
|
|
|
|
{
|
|
|
|
return btrfs_stripe_dev_uuid(btrfs_stripe_nr(c, nr));
|
|
|
|
}
|
|
|
|
|
2008-03-25 02:01:56 +07:00
|
|
|
static inline u64 btrfs_stripe_offset_nr(struct extent_buffer *eb,
|
|
|
|
struct btrfs_chunk *c, int nr)
|
|
|
|
{
|
|
|
|
return btrfs_stripe_offset(eb, btrfs_stripe_nr(c, nr));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline u64 btrfs_stripe_devid_nr(struct extent_buffer *eb,
|
|
|
|
struct btrfs_chunk *c, int nr)
|
|
|
|
{
|
|
|
|
return btrfs_stripe_devid(eb, btrfs_stripe_nr(c, nr));
|
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
/* struct btrfs_block_group_item */
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(block_group_used, struct btrfs_block_group_item,
|
|
|
|
used, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(disk_block_group_used, struct btrfs_block_group_item,
|
|
|
|
used, 64);
|
2008-03-25 02:01:56 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(block_group_chunk_objectid,
|
|
|
|
struct btrfs_block_group_item, chunk_objectid, 64);
|
2008-04-16 02:41:47 +07:00
|
|
|
|
|
|
|
BTRFS_SETGET_FUNCS(disk_block_group_chunk_objectid,
|
2008-03-25 02:01:56 +07:00
|
|
|
struct btrfs_block_group_item, chunk_objectid, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(disk_block_group_flags,
|
|
|
|
struct btrfs_block_group_item, flags, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(block_group_flags,
|
|
|
|
struct btrfs_block_group_item, flags, 64);
|
2007-03-16 06:03:33 +07:00
|
|
|
|
2007-12-13 02:38:19 +07:00
|
|
|
/* struct btrfs_inode_ref */
|
|
|
|
BTRFS_SETGET_FUNCS(inode_ref_name_len, struct btrfs_inode_ref, name_len, 16);
|
2008-07-24 23:12:38 +07:00
|
|
|
BTRFS_SETGET_FUNCS(inode_ref_index, struct btrfs_inode_ref, index, 64);
|
2007-12-13 02:38:19 +07:00
|
|
|
|
2012-08-09 01:32:27 +07:00
|
|
|
/* struct btrfs_inode_extref */
|
|
|
|
BTRFS_SETGET_FUNCS(inode_extref_parent, struct btrfs_inode_extref,
|
|
|
|
parent_objectid, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(inode_extref_name_len, struct btrfs_inode_extref,
|
|
|
|
name_len, 16);
|
|
|
|
BTRFS_SETGET_FUNCS(inode_extref_index, struct btrfs_inode_extref, index, 64);
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
/* struct btrfs_inode_item */
|
|
|
|
BTRFS_SETGET_FUNCS(inode_generation, struct btrfs_inode_item, generation, 64);
|
2008-12-09 04:40:21 +07:00
|
|
|
BTRFS_SETGET_FUNCS(inode_sequence, struct btrfs_inode_item, sequence, 64);
|
2008-09-06 03:13:11 +07:00
|
|
|
BTRFS_SETGET_FUNCS(inode_transid, struct btrfs_inode_item, transid, 64);
|
2007-10-16 03:14:19 +07:00
|
|
|
BTRFS_SETGET_FUNCS(inode_size, struct btrfs_inode_item, size, 64);
|
2008-10-09 22:46:29 +07:00
|
|
|
BTRFS_SETGET_FUNCS(inode_nbytes, struct btrfs_inode_item, nbytes, 64);
|
2007-10-16 03:14:19 +07:00
|
|
|
BTRFS_SETGET_FUNCS(inode_block_group, struct btrfs_inode_item, block_group, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(inode_nlink, struct btrfs_inode_item, nlink, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(inode_uid, struct btrfs_inode_item, uid, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(inode_gid, struct btrfs_inode_item, gid, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(inode_mode, struct btrfs_inode_item, mode, 32);
|
2008-03-25 02:01:56 +07:00
|
|
|
BTRFS_SETGET_FUNCS(inode_rdev, struct btrfs_inode_item, rdev, 64);
|
2008-12-02 18:36:08 +07:00
|
|
|
BTRFS_SETGET_FUNCS(inode_flags, struct btrfs_inode_item, flags, 64);
|
2013-07-16 10:19:18 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_inode_generation, struct btrfs_inode_item,
|
|
|
|
generation, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_inode_sequence, struct btrfs_inode_item,
|
|
|
|
sequence, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_inode_transid, struct btrfs_inode_item,
|
|
|
|
transid, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_inode_size, struct btrfs_inode_item, size, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_inode_nbytes, struct btrfs_inode_item,
|
|
|
|
nbytes, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_inode_block_group, struct btrfs_inode_item,
|
|
|
|
block_group, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_inode_nlink, struct btrfs_inode_item, nlink, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_inode_uid, struct btrfs_inode_item, uid, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_inode_gid, struct btrfs_inode_item, gid, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_inode_mode, struct btrfs_inode_item, mode, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_inode_rdev, struct btrfs_inode_item, rdev, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_inode_flags, struct btrfs_inode_item, flags, 64);
|
2008-03-25 02:01:56 +07:00
|
|
|
BTRFS_SETGET_FUNCS(timespec_sec, struct btrfs_timespec, sec, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
|
2013-07-16 10:19:18 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32);
|
2007-03-22 23:13:20 +07:00
|
|
|
|
2008-03-25 02:01:56 +07:00
|
|
|
/* struct btrfs_dev_extent */
|
2008-04-16 02:41:47 +07:00
|
|
|
BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent,
|
|
|
|
chunk_tree, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(dev_extent_chunk_objectid, struct btrfs_dev_extent,
|
|
|
|
chunk_objectid, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(dev_extent_chunk_offset, struct btrfs_dev_extent,
|
|
|
|
chunk_offset, 64);
|
2008-03-25 02:01:56 +07:00
|
|
|
BTRFS_SETGET_FUNCS(dev_extent_length, struct btrfs_dev_extent, length, 64);
|
|
|
|
|
2013-08-20 18:20:13 +07:00
|
|
|
static inline unsigned long btrfs_dev_extent_chunk_tree_uuid(struct btrfs_dev_extent *dev)
|
2008-04-16 02:41:47 +07:00
|
|
|
{
|
|
|
|
unsigned long ptr = offsetof(struct btrfs_dev_extent, chunk_tree_uuid);
|
2013-08-20 18:20:13 +07:00
|
|
|
return (unsigned long)dev + ptr;
|
2008-04-16 02:41:47 +07:00
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
BTRFS_SETGET_FUNCS(extent_refs, struct btrfs_extent_item, refs, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(extent_generation, struct btrfs_extent_item,
|
|
|
|
generation, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(extent_flags, struct btrfs_extent_item, flags, 64);
|
2007-12-11 21:25:06 +07:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
BTRFS_SETGET_FUNCS(extent_refs_v0, struct btrfs_extent_item_v0, refs, 32);
|
|
|
|
|
|
|
|
|
|
|
|
BTRFS_SETGET_FUNCS(tree_block_level, struct btrfs_tree_block_info, level, 8);
|
|
|
|
|
|
|
|
static inline void btrfs_tree_block_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_tree_block_info *item,
|
|
|
|
struct btrfs_disk_key *key)
|
|
|
|
{
|
|
|
|
read_eb_member(eb, item, struct btrfs_tree_block_info, key, key);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_set_tree_block_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_tree_block_info *item,
|
|
|
|
struct btrfs_disk_key *key)
|
|
|
|
{
|
|
|
|
write_eb_member(eb, item, struct btrfs_tree_block_info, key, key);
|
|
|
|
}
|
2007-03-22 23:13:20 +07:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
BTRFS_SETGET_FUNCS(extent_data_ref_root, struct btrfs_extent_data_ref,
|
|
|
|
root, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(extent_data_ref_objectid, struct btrfs_extent_data_ref,
|
|
|
|
objectid, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(extent_data_ref_offset, struct btrfs_extent_data_ref,
|
|
|
|
offset, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(extent_data_ref_count, struct btrfs_extent_data_ref,
|
|
|
|
count, 32);
|
|
|
|
|
|
|
|
BTRFS_SETGET_FUNCS(shared_data_ref_count, struct btrfs_shared_data_ref,
|
|
|
|
count, 32);
|
|
|
|
|
|
|
|
BTRFS_SETGET_FUNCS(extent_inline_ref_type, struct btrfs_extent_inline_ref,
|
|
|
|
type, 8);
|
|
|
|
BTRFS_SETGET_FUNCS(extent_inline_ref_offset, struct btrfs_extent_inline_ref,
|
|
|
|
offset, 64);
|
|
|
|
|
|
|
|
static inline u32 btrfs_extent_inline_ref_size(int type)
|
|
|
|
{
|
|
|
|
if (type == BTRFS_TREE_BLOCK_REF_KEY ||
|
|
|
|
type == BTRFS_SHARED_BLOCK_REF_KEY)
|
|
|
|
return sizeof(struct btrfs_extent_inline_ref);
|
|
|
|
if (type == BTRFS_SHARED_DATA_REF_KEY)
|
|
|
|
return sizeof(struct btrfs_shared_data_ref) +
|
|
|
|
sizeof(struct btrfs_extent_inline_ref);
|
|
|
|
if (type == BTRFS_EXTENT_DATA_REF_KEY)
|
|
|
|
return sizeof(struct btrfs_extent_data_ref) +
|
|
|
|
offsetof(struct btrfs_extent_inline_ref, offset);
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
BTRFS_SETGET_FUNCS(ref_root_v0, struct btrfs_extent_ref_v0, root, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(ref_generation_v0, struct btrfs_extent_ref_v0,
|
|
|
|
generation, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(ref_objectid_v0, struct btrfs_extent_ref_v0, objectid, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(ref_count_v0, struct btrfs_extent_ref_v0, count, 32);
|
2007-03-22 23:13:20 +07:00
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
/* struct btrfs_node */
|
|
|
|
BTRFS_SETGET_FUNCS(key_blockptr, struct btrfs_key_ptr, blockptr, 64);
|
2007-12-11 21:25:06 +07:00
|
|
|
BTRFS_SETGET_FUNCS(key_generation, struct btrfs_key_ptr, generation, 64);
|
2013-07-16 10:19:18 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_key_blockptr, struct btrfs_key_ptr,
|
|
|
|
blockptr, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_key_generation, struct btrfs_key_ptr,
|
|
|
|
generation, 64);
|
2007-03-22 23:13:20 +07:00
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline u64 btrfs_node_blockptr(struct extent_buffer *eb, int nr)
|
2007-03-13 20:49:06 +07:00
|
|
|
{
|
2007-10-16 03:14:19 +07:00
|
|
|
unsigned long ptr;
|
|
|
|
ptr = offsetof(struct btrfs_node, ptrs) +
|
|
|
|
sizeof(struct btrfs_key_ptr) * nr;
|
|
|
|
return btrfs_key_blockptr(eb, (struct btrfs_key_ptr *)ptr);
|
2007-03-13 20:49:06 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline void btrfs_set_node_blockptr(struct extent_buffer *eb,
|
|
|
|
int nr, u64 val)
|
2007-03-13 20:49:06 +07:00
|
|
|
{
|
2007-10-16 03:14:19 +07:00
|
|
|
unsigned long ptr;
|
|
|
|
ptr = offsetof(struct btrfs_node, ptrs) +
|
|
|
|
sizeof(struct btrfs_key_ptr) * nr;
|
|
|
|
btrfs_set_key_blockptr(eb, (struct btrfs_key_ptr *)ptr, val);
|
2007-03-13 20:49:06 +07:00
|
|
|
}
|
|
|
|
|
2007-12-11 21:25:06 +07:00
|
|
|
static inline u64 btrfs_node_ptr_generation(struct extent_buffer *eb, int nr)
|
|
|
|
{
|
|
|
|
unsigned long ptr;
|
|
|
|
ptr = offsetof(struct btrfs_node, ptrs) +
|
|
|
|
sizeof(struct btrfs_key_ptr) * nr;
|
|
|
|
return btrfs_key_generation(eb, (struct btrfs_key_ptr *)ptr);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_set_node_ptr_generation(struct extent_buffer *eb,
|
|
|
|
int nr, u64 val)
|
|
|
|
{
|
|
|
|
unsigned long ptr;
|
|
|
|
ptr = offsetof(struct btrfs_node, ptrs) +
|
|
|
|
sizeof(struct btrfs_key_ptr) * nr;
|
|
|
|
btrfs_set_key_generation(eb, (struct btrfs_key_ptr *)ptr, val);
|
|
|
|
}
|
|
|
|
|
2007-10-16 03:18:55 +07:00
|
|
|
static inline unsigned long btrfs_node_key_ptr_offset(int nr)
|
2007-04-21 07:23:12 +07:00
|
|
|
{
|
2007-10-16 03:14:19 +07:00
|
|
|
return offsetof(struct btrfs_node, ptrs) +
|
|
|
|
sizeof(struct btrfs_key_ptr) * nr;
|
2007-04-21 07:23:12 +07:00
|
|
|
}
|
|
|
|
|
2007-11-07 03:09:29 +07:00
|
|
|
void btrfs_node_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_disk_key *disk_key, int nr);
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline void btrfs_set_node_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_disk_key *disk_key, int nr)
|
2007-03-13 20:28:32 +07:00
|
|
|
{
|
2007-10-16 03:14:19 +07:00
|
|
|
unsigned long ptr;
|
|
|
|
ptr = btrfs_node_key_ptr_offset(nr);
|
|
|
|
write_eb_member(eb, (struct btrfs_key_ptr *)ptr,
|
|
|
|
struct btrfs_key_ptr, key, disk_key);
|
2007-03-13 20:28:32 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
/* struct btrfs_item */
|
|
|
|
BTRFS_SETGET_FUNCS(item_offset, struct btrfs_item, offset, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(item_size, struct btrfs_item, size, 32);
|
2013-07-16 10:19:18 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_item_offset, struct btrfs_item, offset, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_item_size, struct btrfs_item, size, 32);
|
2007-04-21 07:23:12 +07:00
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline unsigned long btrfs_item_nr_offset(int nr)
|
2007-03-13 20:28:32 +07:00
|
|
|
{
|
2007-10-16 03:14:19 +07:00
|
|
|
return offsetof(struct btrfs_leaf, items) +
|
|
|
|
sizeof(struct btrfs_item) * nr;
|
2007-03-13 20:28:32 +07:00
|
|
|
}
|
|
|
|
|
2013-09-16 21:58:09 +07:00
|
|
|
static inline struct btrfs_item *btrfs_item_nr(int nr)
|
2007-03-13 07:12:07 +07:00
|
|
|
{
|
2007-10-16 03:14:19 +07:00
|
|
|
return (struct btrfs_item *)btrfs_item_nr_offset(nr);
|
2007-03-13 07:12:07 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline u32 btrfs_item_end(struct extent_buffer *eb,
|
|
|
|
struct btrfs_item *item)
|
2007-03-13 07:12:07 +07:00
|
|
|
{
|
2007-10-16 03:14:19 +07:00
|
|
|
return btrfs_item_offset(eb, item) + btrfs_item_size(eb, item);
|
2007-03-13 07:12:07 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline u32 btrfs_item_end_nr(struct extent_buffer *eb, int nr)
|
2007-03-13 07:12:07 +07:00
|
|
|
{
|
2013-09-16 21:58:09 +07:00
|
|
|
return btrfs_item_end(eb, btrfs_item_nr(nr));
|
2007-03-13 07:12:07 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline u32 btrfs_item_offset_nr(struct extent_buffer *eb, int nr)
|
2007-03-13 07:12:07 +07:00
|
|
|
{
|
2013-09-16 21:58:09 +07:00
|
|
|
return btrfs_item_offset(eb, btrfs_item_nr(nr));
|
2007-03-13 07:12:07 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline u32 btrfs_item_size_nr(struct extent_buffer *eb, int nr)
|
2007-03-13 07:12:07 +07:00
|
|
|
{
|
2013-09-16 21:58:09 +07:00
|
|
|
return btrfs_item_size(eb, btrfs_item_nr(nr));
|
2007-03-13 07:12:07 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline void btrfs_item_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_disk_key *disk_key, int nr)
|
2007-03-16 02:18:43 +07:00
|
|
|
{
|
2013-09-16 21:58:09 +07:00
|
|
|
struct btrfs_item *item = btrfs_item_nr(nr);
|
2007-10-16 03:14:19 +07:00
|
|
|
read_eb_member(eb, item, struct btrfs_item, key, disk_key);
|
2007-03-16 02:18:43 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline void btrfs_set_item_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_disk_key *disk_key, int nr)
|
2007-03-16 02:18:43 +07:00
|
|
|
{
|
2013-09-16 21:58:09 +07:00
|
|
|
struct btrfs_item *item = btrfs_item_nr(nr);
|
2007-10-16 03:14:19 +07:00
|
|
|
write_eb_member(eb, item, struct btrfs_item, key, disk_key);
|
2007-03-16 02:18:43 +07:00
|
|
|
}
|
|
|
|
|
2008-09-06 03:13:11 +07:00
|
|
|
BTRFS_SETGET_FUNCS(dir_log_end, struct btrfs_dir_log_item, end, 64);
|
|
|
|
|
2008-11-18 08:37:39 +07:00
|
|
|
/*
|
|
|
|
* struct btrfs_root_ref
|
|
|
|
*/
|
|
|
|
BTRFS_SETGET_FUNCS(root_ref_dirid, struct btrfs_root_ref, dirid, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(root_ref_sequence, struct btrfs_root_ref, sequence, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(root_ref_name_len, struct btrfs_root_ref, name_len, 16);
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
/* struct btrfs_dir_item */
|
2007-11-16 23:45:54 +07:00
|
|
|
BTRFS_SETGET_FUNCS(dir_data_len, struct btrfs_dir_item, data_len, 16);
|
2007-10-16 03:14:19 +07:00
|
|
|
BTRFS_SETGET_FUNCS(dir_type, struct btrfs_dir_item, type, 8);
|
|
|
|
BTRFS_SETGET_FUNCS(dir_name_len, struct btrfs_dir_item, name_len, 16);
|
2008-09-06 03:13:11 +07:00
|
|
|
BTRFS_SETGET_FUNCS(dir_transid, struct btrfs_dir_item, transid, 64);
|
2013-07-16 10:19:18 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_dir_type, struct btrfs_dir_item, type, 8);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_dir_data_len, struct btrfs_dir_item,
|
|
|
|
data_len, 16);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_dir_name_len, struct btrfs_dir_item,
|
|
|
|
name_len, 16);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_dir_transid, struct btrfs_dir_item,
|
|
|
|
transid, 64);
|
2007-03-16 02:18:43 +07:00
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline void btrfs_dir_item_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_dir_item *item,
|
|
|
|
struct btrfs_disk_key *key)
|
2007-03-16 02:18:43 +07:00
|
|
|
{
|
2007-10-16 03:14:19 +07:00
|
|
|
read_eb_member(eb, item, struct btrfs_dir_item, location, key);
|
2007-03-16 02:18:43 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline void btrfs_set_dir_item_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_dir_item *item,
|
|
|
|
struct btrfs_disk_key *key)
|
2007-03-16 19:46:49 +07:00
|
|
|
{
|
2007-10-16 03:14:19 +07:00
|
|
|
write_eb_member(eb, item, struct btrfs_dir_item, location, key);
|
2007-03-16 19:46:49 +07:00
|
|
|
}
|
|
|
|
|
2010-06-22 01:48:16 +07:00
|
|
|
BTRFS_SETGET_FUNCS(free_space_entries, struct btrfs_free_space_header,
|
|
|
|
num_entries, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(free_space_bitmaps, struct btrfs_free_space_header,
|
|
|
|
num_bitmaps, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(free_space_generation, struct btrfs_free_space_header,
|
|
|
|
generation, 64);
|
|
|
|
|
|
|
|
static inline void btrfs_free_space_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_free_space_header *h,
|
|
|
|
struct btrfs_disk_key *key)
|
|
|
|
{
|
|
|
|
read_eb_member(eb, h, struct btrfs_free_space_header, location, key);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_set_free_space_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_free_space_header *h,
|
|
|
|
struct btrfs_disk_key *key)
|
|
|
|
{
|
|
|
|
write_eb_member(eb, h, struct btrfs_free_space_header, location, key);
|
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
/* struct btrfs_disk_key */
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(disk_key_objectid, struct btrfs_disk_key,
|
|
|
|
objectid, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(disk_key_offset, struct btrfs_disk_key, offset, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(disk_key_type, struct btrfs_disk_key, type, 8);
|
2007-03-16 02:18:43 +07:00
|
|
|
|
2007-03-13 03:22:34 +07:00
|
|
|
static inline void btrfs_disk_key_to_cpu(struct btrfs_key *cpu,
|
|
|
|
struct btrfs_disk_key *disk)
|
|
|
|
{
|
|
|
|
cpu->offset = le64_to_cpu(disk->offset);
|
2007-10-16 03:14:19 +07:00
|
|
|
cpu->type = disk->type;
|
2007-03-13 03:22:34 +07:00
|
|
|
cpu->objectid = le64_to_cpu(disk->objectid);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_cpu_key_to_disk(struct btrfs_disk_key *disk,
|
|
|
|
struct btrfs_key *cpu)
|
|
|
|
{
|
|
|
|
disk->offset = cpu_to_le64(cpu->offset);
|
2007-10-16 03:14:19 +07:00
|
|
|
disk->type = cpu->type;
|
2007-03-13 03:22:34 +07:00
|
|
|
disk->objectid = cpu_to_le64(cpu->objectid);
|
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline void btrfs_node_key_to_cpu(struct extent_buffer *eb,
|
|
|
|
struct btrfs_key *key, int nr)
|
2007-03-24 02:56:19 +07:00
|
|
|
{
|
2007-10-16 03:14:19 +07:00
|
|
|
struct btrfs_disk_key disk_key;
|
|
|
|
btrfs_node_key(eb, &disk_key, nr);
|
|
|
|
btrfs_disk_key_to_cpu(key, &disk_key);
|
2007-03-24 02:56:19 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline void btrfs_item_key_to_cpu(struct extent_buffer *eb,
|
|
|
|
struct btrfs_key *key, int nr)
|
2007-03-24 02:56:19 +07:00
|
|
|
{
|
2007-10-16 03:14:19 +07:00
|
|
|
struct btrfs_disk_key disk_key;
|
|
|
|
btrfs_item_key(eb, &disk_key, nr);
|
|
|
|
btrfs_disk_key_to_cpu(key, &disk_key);
|
2007-03-24 02:56:19 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline void btrfs_dir_item_key_to_cpu(struct extent_buffer *eb,
|
|
|
|
struct btrfs_dir_item *item,
|
|
|
|
struct btrfs_key *key)
|
2007-04-21 07:23:12 +07:00
|
|
|
{
|
2007-10-16 03:14:19 +07:00
|
|
|
struct btrfs_disk_key disk_key;
|
|
|
|
btrfs_dir_item_key(eb, item, &disk_key);
|
|
|
|
btrfs_disk_key_to_cpu(key, &disk_key);
|
2007-04-21 07:23:12 +07:00
|
|
|
}
|
|
|
|
|
2007-08-30 02:47:34 +07:00
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline u8 btrfs_key_type(struct btrfs_key *key)
|
2007-03-14 03:47:54 +07:00
|
|
|
{
|
2007-10-16 03:14:19 +07:00
|
|
|
return key->type;
|
2007-03-14 03:47:54 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline void btrfs_set_key_type(struct btrfs_key *key, u8 val)
|
2007-03-14 03:47:54 +07:00
|
|
|
{
|
2007-10-16 03:14:19 +07:00
|
|
|
key->type = val;
|
2007-03-14 03:47:54 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
/* struct btrfs_header */
|
2007-10-16 03:15:53 +07:00
|
|
|
BTRFS_SETGET_HEADER_FUNCS(header_bytenr, struct btrfs_header, bytenr, 64);
|
2007-10-16 03:14:19 +07:00
|
|
|
BTRFS_SETGET_HEADER_FUNCS(header_generation, struct btrfs_header,
|
|
|
|
generation, 64);
|
|
|
|
BTRFS_SETGET_HEADER_FUNCS(header_owner, struct btrfs_header, owner, 64);
|
|
|
|
BTRFS_SETGET_HEADER_FUNCS(header_nritems, struct btrfs_header, nritems, 32);
|
2008-04-01 22:21:32 +07:00
|
|
|
BTRFS_SETGET_HEADER_FUNCS(header_flags, struct btrfs_header, flags, 64);
|
2007-10-16 03:14:19 +07:00
|
|
|
BTRFS_SETGET_HEADER_FUNCS(header_level, struct btrfs_header, level, 8);
|
2013-07-16 10:19:18 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_header_generation, struct btrfs_header,
|
|
|
|
generation, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_header_owner, struct btrfs_header, owner, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_header_nritems, struct btrfs_header,
|
|
|
|
nritems, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_header_bytenr, struct btrfs_header, bytenr, 64);
|
2007-04-09 21:42:37 +07:00
|
|
|
|
2008-04-01 22:21:32 +07:00
|
|
|
static inline int btrfs_header_flag(struct extent_buffer *eb, u64 flag)
|
|
|
|
{
|
|
|
|
return (btrfs_header_flags(eb) & flag) == flag;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int btrfs_set_header_flag(struct extent_buffer *eb, u64 flag)
|
|
|
|
{
|
|
|
|
u64 flags = btrfs_header_flags(eb);
|
|
|
|
btrfs_set_header_flags(eb, flags | flag);
|
|
|
|
return (flags & flag) == flag;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int btrfs_clear_header_flag(struct extent_buffer *eb, u64 flag)
|
|
|
|
{
|
|
|
|
u64 flags = btrfs_header_flags(eb);
|
|
|
|
btrfs_set_header_flags(eb, flags & ~flag);
|
|
|
|
return (flags & flag) == flag;
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
static inline int btrfs_header_backref_rev(struct extent_buffer *eb)
|
|
|
|
{
|
|
|
|
u64 flags = btrfs_header_flags(eb);
|
|
|
|
return flags >> BTRFS_BACKREF_REV_SHIFT;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_set_header_backref_rev(struct extent_buffer *eb,
|
|
|
|
int rev)
|
|
|
|
{
|
|
|
|
u64 flags = btrfs_header_flags(eb);
|
|
|
|
flags &= ~BTRFS_BACKREF_REV_MASK;
|
|
|
|
flags |= (u64)rev << BTRFS_BACKREF_REV_SHIFT;
|
|
|
|
btrfs_set_header_flags(eb, flags);
|
|
|
|
}
|
|
|
|
|
2013-09-24 16:12:38 +07:00
|
|
|
static inline unsigned long btrfs_header_fsid(void)
|
2007-04-09 21:42:37 +07:00
|
|
|
{
|
2013-08-20 18:20:14 +07:00
|
|
|
return offsetof(struct btrfs_header, fsid);
|
2007-04-09 21:42:37 +07:00
|
|
|
}
|
|
|
|
|
2013-08-20 18:20:15 +07:00
|
|
|
static inline unsigned long btrfs_header_chunk_tree_uuid(struct extent_buffer *eb)
|
2008-04-16 02:41:47 +07:00
|
|
|
{
|
2013-08-20 18:20:15 +07:00
|
|
|
return offsetof(struct btrfs_header, chunk_tree_uuid);
|
2008-04-16 02:41:47 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline int btrfs_is_leaf(struct extent_buffer *eb)
|
2007-03-14 03:47:54 +07:00
|
|
|
{
|
2009-01-06 09:25:51 +07:00
|
|
|
return btrfs_header_level(eb) == 0;
|
2007-03-14 03:47:54 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
/* struct btrfs_root_item */
|
2008-10-30 01:49:05 +07:00
|
|
|
BTRFS_SETGET_FUNCS(disk_root_generation, struct btrfs_root_item,
|
|
|
|
generation, 64);
|
2007-10-16 03:14:19 +07:00
|
|
|
BTRFS_SETGET_FUNCS(disk_root_refs, struct btrfs_root_item, refs, 32);
|
2007-10-16 03:15:53 +07:00
|
|
|
BTRFS_SETGET_FUNCS(disk_root_bytenr, struct btrfs_root_item, bytenr, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(disk_root_level, struct btrfs_root_item, level, 8);
|
2007-03-14 03:47:54 +07:00
|
|
|
|
2008-10-30 01:49:05 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_generation, struct btrfs_root_item,
|
|
|
|
generation, 64);
|
2007-10-16 03:15:53 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_bytenr, struct btrfs_root_item, bytenr, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_level, struct btrfs_root_item, level, 8);
|
2007-10-16 03:14:19 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_dirid, struct btrfs_root_item, root_dirid, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_refs, struct btrfs_root_item, refs, 32);
|
2008-12-02 18:36:08 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_flags, struct btrfs_root_item, flags, 64);
|
2007-10-16 03:15:53 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_used, struct btrfs_root_item, bytes_used, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_limit, struct btrfs_root_item, byte_limit, 64);
|
2008-10-31 01:20:02 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_last_snapshot, struct btrfs_root_item,
|
|
|
|
last_snapshot, 64);
|
2012-07-25 22:35:53 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_generation_v2, struct btrfs_root_item,
|
|
|
|
generation_v2, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_ctransid, struct btrfs_root_item,
|
|
|
|
ctransid, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_otransid, struct btrfs_root_item,
|
|
|
|
otransid, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_stransid, struct btrfs_root_item,
|
|
|
|
stransid, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_rtransid, struct btrfs_root_item,
|
|
|
|
rtransid, 64);
|
2007-03-15 01:14:43 +07:00
|
|
|
|
2010-12-20 15:04:08 +07:00
|
|
|
static inline bool btrfs_root_readonly(struct btrfs_root *root)
|
|
|
|
{
|
2012-04-13 22:49:04 +07:00
|
|
|
return (root->root_item.flags & cpu_to_le64(BTRFS_ROOT_SUBVOL_RDONLY)) != 0;
|
2010-12-20 15:04:08 +07:00
|
|
|
}
|
|
|
|
|
2014-04-15 21:41:44 +07:00
|
|
|
static inline bool btrfs_root_dead(struct btrfs_root *root)
|
|
|
|
{
|
|
|
|
return (root->root_item.flags & cpu_to_le64(BTRFS_ROOT_SUBVOL_DEAD)) != 0;
|
|
|
|
}
|
|
|
|
|
2011-11-04 02:17:42 +07:00
|
|
|
/* struct btrfs_root_backup */
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_tree_root, struct btrfs_root_backup,
|
|
|
|
tree_root, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_tree_root_gen, struct btrfs_root_backup,
|
|
|
|
tree_root_gen, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_tree_root_level, struct btrfs_root_backup,
|
|
|
|
tree_root_level, 8);
|
|
|
|
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_chunk_root, struct btrfs_root_backup,
|
|
|
|
chunk_root, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_chunk_root_gen, struct btrfs_root_backup,
|
|
|
|
chunk_root_gen, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_chunk_root_level, struct btrfs_root_backup,
|
|
|
|
chunk_root_level, 8);
|
|
|
|
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_extent_root, struct btrfs_root_backup,
|
|
|
|
extent_root, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_extent_root_gen, struct btrfs_root_backup,
|
|
|
|
extent_root_gen, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_extent_root_level, struct btrfs_root_backup,
|
|
|
|
extent_root_level, 8);
|
|
|
|
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_fs_root, struct btrfs_root_backup,
|
|
|
|
fs_root, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_fs_root_gen, struct btrfs_root_backup,
|
|
|
|
fs_root_gen, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_fs_root_level, struct btrfs_root_backup,
|
|
|
|
fs_root_level, 8);
|
|
|
|
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_dev_root, struct btrfs_root_backup,
|
|
|
|
dev_root, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_dev_root_gen, struct btrfs_root_backup,
|
|
|
|
dev_root_gen, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_dev_root_level, struct btrfs_root_backup,
|
|
|
|
dev_root_level, 8);
|
|
|
|
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_csum_root, struct btrfs_root_backup,
|
|
|
|
csum_root, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_csum_root_gen, struct btrfs_root_backup,
|
|
|
|
csum_root_gen, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_csum_root_level, struct btrfs_root_backup,
|
|
|
|
csum_root_level, 8);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_total_bytes, struct btrfs_root_backup,
|
|
|
|
total_bytes, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_bytes_used, struct btrfs_root_backup,
|
|
|
|
bytes_used, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_num_devices, struct btrfs_root_backup,
|
|
|
|
num_devices, 64);
|
|
|
|
|
2012-01-17 03:04:48 +07:00
|
|
|
/* struct btrfs_balance_item */
|
|
|
|
BTRFS_SETGET_FUNCS(balance_flags, struct btrfs_balance_item, flags, 64);
|
2008-12-02 19:17:45 +07:00
|
|
|
|
2012-01-17 03:04:48 +07:00
|
|
|
static inline void btrfs_balance_data(struct extent_buffer *eb,
|
|
|
|
struct btrfs_balance_item *bi,
|
|
|
|
struct btrfs_disk_balance_args *ba)
|
|
|
|
{
|
|
|
|
read_eb_member(eb, bi, struct btrfs_balance_item, data, ba);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_set_balance_data(struct extent_buffer *eb,
|
|
|
|
struct btrfs_balance_item *bi,
|
|
|
|
struct btrfs_disk_balance_args *ba)
|
|
|
|
{
|
|
|
|
write_eb_member(eb, bi, struct btrfs_balance_item, data, ba);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_balance_meta(struct extent_buffer *eb,
|
|
|
|
struct btrfs_balance_item *bi,
|
|
|
|
struct btrfs_disk_balance_args *ba)
|
|
|
|
{
|
|
|
|
read_eb_member(eb, bi, struct btrfs_balance_item, meta, ba);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_set_balance_meta(struct extent_buffer *eb,
|
|
|
|
struct btrfs_balance_item *bi,
|
|
|
|
struct btrfs_disk_balance_args *ba)
|
|
|
|
{
|
|
|
|
write_eb_member(eb, bi, struct btrfs_balance_item, meta, ba);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_balance_sys(struct extent_buffer *eb,
|
|
|
|
struct btrfs_balance_item *bi,
|
|
|
|
struct btrfs_disk_balance_args *ba)
|
|
|
|
{
|
|
|
|
read_eb_member(eb, bi, struct btrfs_balance_item, sys, ba);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_set_balance_sys(struct extent_buffer *eb,
|
|
|
|
struct btrfs_balance_item *bi,
|
|
|
|
struct btrfs_disk_balance_args *ba)
|
|
|
|
{
|
|
|
|
write_eb_member(eb, bi, struct btrfs_balance_item, sys, ba);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
btrfs_disk_balance_args_to_cpu(struct btrfs_balance_args *cpu,
|
|
|
|
struct btrfs_disk_balance_args *disk)
|
|
|
|
{
|
|
|
|
memset(cpu, 0, sizeof(*cpu));
|
|
|
|
|
|
|
|
cpu->profiles = le64_to_cpu(disk->profiles);
|
|
|
|
cpu->usage = le64_to_cpu(disk->usage);
|
|
|
|
cpu->devid = le64_to_cpu(disk->devid);
|
|
|
|
cpu->pstart = le64_to_cpu(disk->pstart);
|
|
|
|
cpu->pend = le64_to_cpu(disk->pend);
|
|
|
|
cpu->vstart = le64_to_cpu(disk->vstart);
|
|
|
|
cpu->vend = le64_to_cpu(disk->vend);
|
|
|
|
cpu->target = le64_to_cpu(disk->target);
|
|
|
|
cpu->flags = le64_to_cpu(disk->flags);
|
2014-05-07 22:37:51 +07:00
|
|
|
cpu->limit = le64_to_cpu(disk->limit);
|
2012-01-17 03:04:48 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
btrfs_cpu_balance_args_to_disk(struct btrfs_disk_balance_args *disk,
|
|
|
|
struct btrfs_balance_args *cpu)
|
|
|
|
{
|
|
|
|
memset(disk, 0, sizeof(*disk));
|
|
|
|
|
|
|
|
disk->profiles = cpu_to_le64(cpu->profiles);
|
|
|
|
disk->usage = cpu_to_le64(cpu->usage);
|
|
|
|
disk->devid = cpu_to_le64(cpu->devid);
|
|
|
|
disk->pstart = cpu_to_le64(cpu->pstart);
|
|
|
|
disk->pend = cpu_to_le64(cpu->pend);
|
|
|
|
disk->vstart = cpu_to_le64(cpu->vstart);
|
|
|
|
disk->vend = cpu_to_le64(cpu->vend);
|
|
|
|
disk->target = cpu_to_le64(cpu->target);
|
|
|
|
disk->flags = cpu_to_le64(cpu->flags);
|
2014-05-07 22:37:51 +07:00
|
|
|
disk->limit = cpu_to_le64(cpu->limit);
|
2012-01-17 03:04:48 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* struct btrfs_super_block */
|
2007-10-16 03:15:53 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_bytenr, struct btrfs_super_block, bytenr, 64);
|
2008-05-07 22:43:44 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_flags, struct btrfs_super_block, flags, 64);
|
2007-10-16 03:14:19 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_generation, struct btrfs_super_block,
|
|
|
|
generation, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_root, struct btrfs_super_block, root, 64);
|
2008-03-25 02:01:56 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_sys_array_size,
|
|
|
|
struct btrfs_super_block, sys_chunk_array_size, 32);
|
2008-10-30 01:49:05 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_chunk_root_generation,
|
|
|
|
struct btrfs_super_block, chunk_root_generation, 64);
|
2007-10-16 03:15:53 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_root_level, struct btrfs_super_block,
|
|
|
|
root_level, 8);
|
2008-03-25 02:01:56 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_chunk_root, struct btrfs_super_block,
|
|
|
|
chunk_root, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_chunk_root_level, struct btrfs_super_block,
|
2008-09-06 03:13:11 +07:00
|
|
|
chunk_root_level, 8);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_log_root, struct btrfs_super_block,
|
|
|
|
log_root, 64);
|
2008-12-09 04:40:21 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block,
|
|
|
|
log_root_transid, 64);
|
2008-09-06 03:13:11 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block,
|
|
|
|
log_root_level, 8);
|
2007-10-16 03:15:53 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block,
|
|
|
|
total_bytes, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_bytes_used, struct btrfs_super_block,
|
|
|
|
bytes_used, 64);
|
2007-10-16 03:14:19 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_sectorsize, struct btrfs_super_block,
|
|
|
|
sectorsize, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_nodesize, struct btrfs_super_block,
|
|
|
|
nodesize, 32);
|
2007-11-30 23:30:34 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_stripesize, struct btrfs_super_block,
|
|
|
|
stripesize, 32);
|
2007-10-16 03:14:19 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_root_dir, struct btrfs_super_block,
|
|
|
|
root_dir_objectid, 64);
|
2008-03-25 02:02:07 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_num_devices, struct btrfs_super_block,
|
|
|
|
num_devices, 64);
|
2008-12-02 18:36:08 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_compat_flags, struct btrfs_super_block,
|
|
|
|
compat_flags, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_compat_ro_flags, struct btrfs_super_block,
|
2009-12-18 04:32:27 +07:00
|
|
|
compat_ro_flags, 64);
|
2008-12-02 18:36:08 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_incompat_flags, struct btrfs_super_block,
|
|
|
|
incompat_flags, 64);
|
2008-12-02 19:17:45 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_csum_type, struct btrfs_super_block,
|
|
|
|
csum_type, 16);
|
2010-06-22 01:48:16 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_cache_generation, struct btrfs_super_block,
|
|
|
|
cache_generation, 64);
|
2013-07-16 10:19:18 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_magic, struct btrfs_super_block, magic, 64);
|
2013-08-15 22:11:22 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_uuid_tree_generation, struct btrfs_super_block,
|
|
|
|
uuid_tree_generation, 64);
|
2008-12-02 19:17:45 +07:00
|
|
|
|
|
|
|
static inline int btrfs_super_csum_size(struct btrfs_super_block *s)
|
|
|
|
{
|
2013-03-06 21:57:46 +07:00
|
|
|
u16 t = btrfs_super_csum_type(s);
|
|
|
|
/*
|
|
|
|
* csum type is validated at mount time
|
|
|
|
*/
|
2008-12-02 19:17:45 +07:00
|
|
|
return btrfs_csum_sizes[t];
|
|
|
|
}
|
2007-03-21 22:12:56 +07:00
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
static inline unsigned long btrfs_leaf_data(struct extent_buffer *l)
|
2007-03-21 22:12:56 +07:00
|
|
|
{
|
2007-10-16 03:14:19 +07:00
|
|
|
return offsetof(struct btrfs_leaf, items);
|
2007-03-21 22:12:56 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:14:19 +07:00
|
|
|
/* struct btrfs_file_extent_item */
|
|
|
|
BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
|
2013-07-16 10:19:18 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_bytenr,
|
|
|
|
struct btrfs_file_extent_item, disk_bytenr, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_file_extent_offset,
|
|
|
|
struct btrfs_file_extent_item, offset, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_file_extent_generation,
|
|
|
|
struct btrfs_file_extent_item, generation, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_file_extent_num_bytes,
|
|
|
|
struct btrfs_file_extent_item, num_bytes, 64);
|
2013-11-14 09:11:49 +07:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_num_bytes,
|
|
|
|
struct btrfs_file_extent_item, disk_num_bytes, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_file_extent_compression,
|
|
|
|
struct btrfs_file_extent_item, compression, 8);
|
2007-03-21 01:38:32 +07:00
|
|
|
|
2009-01-06 09:25:51 +07:00
|
|
|
static inline unsigned long
|
|
|
|
btrfs_file_extent_inline_start(struct btrfs_file_extent_item *e)
|
2007-04-20 00:37:44 +07:00
|
|
|
{
|
2014-07-24 22:34:58 +07:00
|
|
|
return (unsigned long)e + BTRFS_FILE_EXTENT_INLINE_DATA_START;
|
2007-04-20 00:37:44 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline u32 btrfs_file_extent_calc_inline_size(u32 datasize)
|
|
|
|
{
|
2014-07-24 22:34:58 +07:00
|
|
|
return BTRFS_FILE_EXTENT_INLINE_DATA_START + datasize;
|
2007-03-21 01:38:32 +07:00
|
|
|
}
|
|
|
|
|
2007-10-16 03:15:53 +07:00
|
|
|
BTRFS_SETGET_FUNCS(file_extent_disk_bytenr, struct btrfs_file_extent_item,
|
|
|
|
disk_bytenr, 64);
|
2007-10-16 03:14:19 +07:00
|
|
|
BTRFS_SETGET_FUNCS(file_extent_generation, struct btrfs_file_extent_item,
|
|
|
|
generation, 64);
|
2007-10-16 03:15:53 +07:00
|
|
|
BTRFS_SETGET_FUNCS(file_extent_disk_num_bytes, struct btrfs_file_extent_item,
|
|
|
|
disk_num_bytes, 64);
|
2007-10-16 03:14:19 +07:00
|
|
|
BTRFS_SETGET_FUNCS(file_extent_offset, struct btrfs_file_extent_item,
|
|
|
|
offset, 64);
|
2007-10-16 03:15:53 +07:00
|
|
|
BTRFS_SETGET_FUNCS(file_extent_num_bytes, struct btrfs_file_extent_item,
|
|
|
|
num_bytes, 64);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 01:49:59 +07:00
|
|
|
BTRFS_SETGET_FUNCS(file_extent_ram_bytes, struct btrfs_file_extent_item,
|
|
|
|
ram_bytes, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(file_extent_compression, struct btrfs_file_extent_item,
|
|
|
|
compression, 8);
|
|
|
|
BTRFS_SETGET_FUNCS(file_extent_encryption, struct btrfs_file_extent_item,
|
|
|
|
encryption, 8);
|
|
|
|
BTRFS_SETGET_FUNCS(file_extent_other_encoding, struct btrfs_file_extent_item,
|
|
|
|
other_encoding, 16);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* this returns the number of bytes used by the item on disk, minus the
|
|
|
|
* size of any extent headers. If a file is compressed on disk, this is
|
|
|
|
* the compressed size
|
|
|
|
*/
|
|
|
|
static inline u32 btrfs_file_extent_inline_item_len(struct extent_buffer *eb,
|
|
|
|
struct btrfs_item *e)
|
|
|
|
{
|
2014-07-24 22:34:58 +07:00
|
|
|
return btrfs_item_size(eb, e) - BTRFS_FILE_EXTENT_INLINE_DATA_START;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 01:49:59 +07:00
|
|
|
}
|
2007-03-21 01:38:32 +07:00
|
|
|
|
2014-01-04 12:07:00 +07:00
|
|
|
/* this returns the number of file bytes represented by the inline item.
|
|
|
|
* If an item is compressed, this is the uncompressed size
|
|
|
|
*/
|
|
|
|
static inline u32 btrfs_file_extent_inline_len(struct extent_buffer *eb,
|
|
|
|
int slot,
|
|
|
|
struct btrfs_file_extent_item *fi)
|
|
|
|
{
|
|
|
|
struct btrfs_map_token token;
|
|
|
|
|
|
|
|
btrfs_init_map_token(&token);
|
|
|
|
/*
|
|
|
|
* return the space used on disk if this item isn't
|
|
|
|
* compressed or encoded
|
|
|
|
*/
|
|
|
|
if (btrfs_token_file_extent_compression(eb, fi, &token) == 0 &&
|
|
|
|
btrfs_token_file_extent_encryption(eb, fi, &token) == 0 &&
|
|
|
|
btrfs_token_file_extent_other_encoding(eb, fi, &token) == 0) {
|
|
|
|
return btrfs_file_extent_inline_item_len(eb,
|
|
|
|
btrfs_item_nr(slot));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* otherwise use the ram bytes field */
|
|
|
|
return btrfs_token_file_extent_ram_bytes(eb, fi, &token);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2012-05-25 21:06:10 +07:00
|
|
|
/* btrfs_dev_stats_item */
|
|
|
|
static inline u64 btrfs_dev_stats_value(struct extent_buffer *eb,
|
|
|
|
struct btrfs_dev_stats_item *ptr,
|
|
|
|
int index)
|
|
|
|
{
|
|
|
|
u64 val;
|
|
|
|
|
|
|
|
read_extent_buffer(eb, &val,
|
|
|
|
offsetof(struct btrfs_dev_stats_item, values) +
|
|
|
|
((unsigned long)ptr) + (index * sizeof(u64)),
|
|
|
|
sizeof(val));
|
|
|
|
return val;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_set_dev_stats_value(struct extent_buffer *eb,
|
|
|
|
struct btrfs_dev_stats_item *ptr,
|
|
|
|
int index, u64 val)
|
|
|
|
{
|
|
|
|
write_extent_buffer(eb, &val,
|
|
|
|
offsetof(struct btrfs_dev_stats_item, values) +
|
|
|
|
((unsigned long)ptr) + (index * sizeof(u64)),
|
|
|
|
sizeof(val));
|
|
|
|
}
|
|
|
|
|
2011-09-13 16:06:07 +07:00
|
|
|
/* btrfs_qgroup_status_item */
|
|
|
|
BTRFS_SETGET_FUNCS(qgroup_status_generation, struct btrfs_qgroup_status_item,
|
|
|
|
generation, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(qgroup_status_version, struct btrfs_qgroup_status_item,
|
|
|
|
version, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(qgroup_status_flags, struct btrfs_qgroup_status_item,
|
|
|
|
flags, 64);
|
2013-04-25 23:04:51 +07:00
|
|
|
BTRFS_SETGET_FUNCS(qgroup_status_rescan, struct btrfs_qgroup_status_item,
|
|
|
|
rescan, 64);
|
2011-09-13 16:06:07 +07:00
|
|
|
|
|
|
|
/* btrfs_qgroup_info_item */
|
|
|
|
BTRFS_SETGET_FUNCS(qgroup_info_generation, struct btrfs_qgroup_info_item,
|
|
|
|
generation, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(qgroup_info_rfer, struct btrfs_qgroup_info_item, rfer, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(qgroup_info_rfer_cmpr, struct btrfs_qgroup_info_item,
|
|
|
|
rfer_cmpr, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(qgroup_info_excl, struct btrfs_qgroup_info_item, excl, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(qgroup_info_excl_cmpr, struct btrfs_qgroup_info_item,
|
|
|
|
excl_cmpr, 64);
|
|
|
|
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_qgroup_info_generation,
|
|
|
|
struct btrfs_qgroup_info_item, generation, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_qgroup_info_rfer, struct btrfs_qgroup_info_item,
|
|
|
|
rfer, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_qgroup_info_rfer_cmpr,
|
|
|
|
struct btrfs_qgroup_info_item, rfer_cmpr, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_qgroup_info_excl, struct btrfs_qgroup_info_item,
|
|
|
|
excl, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_qgroup_info_excl_cmpr,
|
|
|
|
struct btrfs_qgroup_info_item, excl_cmpr, 64);
|
|
|
|
|
|
|
|
/* btrfs_qgroup_limit_item */
|
|
|
|
BTRFS_SETGET_FUNCS(qgroup_limit_flags, struct btrfs_qgroup_limit_item,
|
|
|
|
flags, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(qgroup_limit_max_rfer, struct btrfs_qgroup_limit_item,
|
|
|
|
max_rfer, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(qgroup_limit_max_excl, struct btrfs_qgroup_limit_item,
|
|
|
|
max_excl, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(qgroup_limit_rsv_rfer, struct btrfs_qgroup_limit_item,
|
|
|
|
rsv_rfer, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(qgroup_limit_rsv_excl, struct btrfs_qgroup_limit_item,
|
|
|
|
rsv_excl, 64);
|
|
|
|
|
2012-11-05 23:32:20 +07:00
|
|
|
/* btrfs_dev_replace_item */
|
|
|
|
BTRFS_SETGET_FUNCS(dev_replace_src_devid,
|
|
|
|
struct btrfs_dev_replace_item, src_devid, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(dev_replace_cont_reading_from_srcdev_mode,
|
|
|
|
struct btrfs_dev_replace_item, cont_reading_from_srcdev_mode,
|
|
|
|
64);
|
|
|
|
BTRFS_SETGET_FUNCS(dev_replace_replace_state, struct btrfs_dev_replace_item,
|
|
|
|
replace_state, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(dev_replace_time_started, struct btrfs_dev_replace_item,
|
|
|
|
time_started, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(dev_replace_time_stopped, struct btrfs_dev_replace_item,
|
|
|
|
time_stopped, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(dev_replace_num_write_errors, struct btrfs_dev_replace_item,
|
|
|
|
num_write_errors, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(dev_replace_num_uncorrectable_read_errors,
|
|
|
|
struct btrfs_dev_replace_item, num_uncorrectable_read_errors,
|
|
|
|
64);
|
|
|
|
BTRFS_SETGET_FUNCS(dev_replace_cursor_left, struct btrfs_dev_replace_item,
|
|
|
|
cursor_left, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(dev_replace_cursor_right, struct btrfs_dev_replace_item,
|
|
|
|
cursor_right, 64);
|
|
|
|
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_src_devid,
|
|
|
|
struct btrfs_dev_replace_item, src_devid, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_cont_reading_from_srcdev_mode,
|
|
|
|
struct btrfs_dev_replace_item,
|
|
|
|
cont_reading_from_srcdev_mode, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_replace_state,
|
|
|
|
struct btrfs_dev_replace_item, replace_state, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_time_started,
|
|
|
|
struct btrfs_dev_replace_item, time_started, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_time_stopped,
|
|
|
|
struct btrfs_dev_replace_item, time_stopped, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_num_write_errors,
|
|
|
|
struct btrfs_dev_replace_item, num_write_errors, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_num_uncorrectable_read_errors,
|
|
|
|
struct btrfs_dev_replace_item,
|
|
|
|
num_uncorrectable_read_errors, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_cursor_left,
|
|
|
|
struct btrfs_dev_replace_item, cursor_left, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_cursor_right,
|
|
|
|
struct btrfs_dev_replace_item, cursor_right, 64);
|
|
|
|
|
2011-11-18 03:40:49 +07:00
|
|
|
static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
|
2007-03-22 23:13:20 +07:00
|
|
|
{
|
|
|
|
return sb->s_fs_info;
|
|
|
|
}
|
|
|
|
|
2007-03-14 21:31:29 +07:00
|
|
|
/* helper function to cast into the data area of the leaf. */
|
|
|
|
#define btrfs_item_ptr(leaf, slot, type) \
|
2007-03-15 01:14:43 +07:00
|
|
|
((type *)(btrfs_leaf_data(leaf) + \
|
2007-10-16 03:14:19 +07:00
|
|
|
btrfs_item_offset_nr(leaf, slot)))
|
|
|
|
|
|
|
|
#define btrfs_item_ptr_offset(leaf, slot) \
|
|
|
|
((unsigned long)(btrfs_leaf_data(leaf) + \
|
|
|
|
btrfs_item_offset_nr(leaf, slot)))
|
2007-03-14 21:31:29 +07:00
|
|
|
|
2010-09-17 03:19:09 +07:00
|
|
|
static inline bool btrfs_mixed_space_info(struct btrfs_space_info *space_info)
|
|
|
|
{
|
|
|
|
return ((space_info->flags & BTRFS_BLOCK_GROUP_METADATA) &&
|
|
|
|
(space_info->flags & BTRFS_BLOCK_GROUP_DATA));
|
|
|
|
}
|
|
|
|
|
2011-09-22 02:05:58 +07:00
|
|
|
static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping)
|
|
|
|
{
|
2015-11-07 07:28:49 +07:00
|
|
|
return mapping_gfp_constraint(mapping, ~__GFP_FS);
|
2011-09-22 02:05:58 +07:00
|
|
|
}
|
|
|
|
|
2007-04-18 00:26:50 +07:00
|
|
|
/* extent-tree.c */
|
2015-02-04 21:59:29 +07:00
|
|
|
|
|
|
|
u64 btrfs_csum_bytes_to_leaves(struct btrfs_root *root, u64 csum_bytes);
|
|
|
|
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 17:12:22 +07:00
|
|
|
static inline u64 btrfs_calc_trans_metadata_size(struct btrfs_root *root,
|
2011-07-15 22:16:44 +07:00
|
|
|
unsigned num_items)
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 17:12:22 +07:00
|
|
|
{
|
2014-06-05 00:22:26 +07:00
|
|
|
return (root->nodesize + root->nodesize * (BTRFS_MAX_LEVEL - 1)) *
|
2013-09-17 21:50:06 +07:00
|
|
|
2 * num_items;
|
2011-08-19 21:29:59 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Doing a truncate won't result in new nodes or leaves, just what we need for
|
|
|
|
* COW.
|
|
|
|
*/
|
|
|
|
static inline u64 btrfs_calc_trunc_metadata_size(struct btrfs_root *root,
|
|
|
|
unsigned num_items)
|
|
|
|
{
|
2014-06-05 00:22:26 +07:00
|
|
|
return root->nodesize * BTRFS_MAX_LEVEL * num_items;
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 17:12:22 +07:00
|
|
|
}
|
|
|
|
|
2013-06-13 00:56:06 +07:00
|
|
|
int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2014-01-23 22:54:11 +07:00
|
|
|
int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2009-04-03 20:47:43 +07:00
|
|
|
void btrfs_put_block_group(struct btrfs_block_group_cache *cache);
|
2009-03-13 21:10:06 +07:00
|
|
|
int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, unsigned long count);
|
2014-05-23 06:18:52 +07:00
|
|
|
int btrfs_async_run_delayed_refs(struct btrfs_root *root,
|
|
|
|
unsigned long count, int wait);
|
2014-10-27 17:44:24 +07:00
|
|
|
int btrfs_lookup_data_extent(struct btrfs_root *root, u64 start, u64 len);
|
2010-05-16 21:48:46 +07:00
|
|
|
int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 bytenr,
|
2013-03-08 02:22:04 +07:00
|
|
|
u64 offset, int metadata, u64 *refs, u64 *flags);
|
2009-09-12 03:11:19 +07:00
|
|
|
int btrfs_pin_extent(struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num, int reserved);
|
2012-12-27 16:01:20 +07:00
|
|
|
int btrfs_pin_extent_for_log_replay(struct btrfs_root *root,
|
2011-11-01 07:52:39 +07:00
|
|
|
u64 bytenr, u64 num_bytes);
|
2013-06-07 00:19:32 +07:00
|
|
|
int btrfs_exclude_logged_extents(struct btrfs_root *root,
|
|
|
|
struct extent_buffer *eb);
|
2008-10-31 01:20:02 +07:00
|
|
|
int btrfs_cross_ref_exist(struct btrfs_trans_handle *trans,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 objectid, u64 offset, u64 bytenr);
|
2009-01-06 09:25:51 +07:00
|
|
|
struct btrfs_block_group_cache *btrfs_lookup_block_group(
|
|
|
|
struct btrfs_fs_info *info,
|
|
|
|
u64 bytenr);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
void btrfs_put_block_group(struct btrfs_block_group_cache *cache);
|
2013-11-02 00:07:04 +07:00
|
|
|
int get_block_group_index(struct btrfs_block_group_cache *cache);
|
2014-06-15 06:54:12 +07:00
|
|
|
struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 parent,
|
|
|
|
u64 root_objectid,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
struct btrfs_disk_key *key, int level,
|
2012-05-16 22:04:52 +07:00
|
|
|
u64 hint, u64 empty_size);
|
2010-05-16 21:46:25 +07:00
|
|
|
void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct extent_buffer *buf,
|
2012-05-16 22:04:52 +07:00
|
|
|
u64 parent, int last_ref);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 root_objectid, u64 owner,
|
2015-10-26 13:11:18 +07:00
|
|
|
u64 offset, u64 ram_bytes,
|
|
|
|
struct btrfs_key *ins);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 root_objectid, u64 owner, u64 offset,
|
|
|
|
struct btrfs_key *ins);
|
2013-08-15 01:02:47 +07:00
|
|
|
int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
|
|
|
|
u64 min_alloc_size, u64 empty_size, u64 hint_byte,
|
Btrfs: fix broken free space cache after the system crashed
When we mounted the filesystem after the crash, we got the following
message:
BTRFS error (device xxx): block group xxxx has wrong amount of free space
BTRFS error (device xxx): failed to load free space cache for block group xxx
It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.
There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
space cache
- account the size of the allocated space that is used to store the file
data, if the size is not zero, don't write out the free space cache.
The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 09:42:50 +07:00
|
|
|
struct btrfs_key *ins, int is_data, int delalloc);
|
2007-03-17 03:20:31 +07:00
|
|
|
int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
2014-07-03 00:54:25 +07:00
|
|
|
struct extent_buffer *buf, int full_backref);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
2014-07-03 00:54:25 +07:00
|
|
|
struct extent_buffer *buf, int full_backref);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 flags,
|
2013-05-10 00:49:30 +07:00
|
|
|
int level, int is_data);
|
2008-09-24 00:14:14 +07:00
|
|
|
int btrfs_free_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
2011-09-12 20:26:38 +07:00
|
|
|
u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
|
Btrfs: fix regression running delayed references when using qgroups
In the kernel 4.2 merge window we had a big changes to the implementation
of delayed references and qgroups which made the no_quota field of delayed
references not used anymore. More specifically the no_quota field is not
used anymore as of:
commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
Leaving the no_quota field actually prevents delayed references from
getting merged, which in turn cause the following BUG_ON(), at
fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
static int run_delayed_tree_ref(...)
{
(...)
BUG_ON(node->ref_mod != 1);
(...)
}
This happens on a scenario like the following:
1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref2 is incompatible
due to Ref2->no_quota != Ref3->no_quota.
4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref3 is incompatible
due to Ref3->no_quota != Ref4->no_quota.
5) We run delayed references, trigger merging of delayed references,
through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
all other conditions are satisfied too. So Ref1 gets a ref_mod
value of 2.
7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
all other conditions are satisfied too. So Ref2 gets a ref_mod
value of 2.
8) Ref1 and Ref2 aren't merged, because they have different values
for their no_quota field.
9) Delayed reference Ref1 is picked for running (select_delayed_ref()
always prefers references with an action == BTRFS_ADD_DELAYED_REF).
So run_delayed_tree_ref() is called for Ref1 which triggers the
BUG_ON because Ref1->red_mod != 1 (equals 2).
So fix this by removing the no_quota field, as it's not used anymore as
of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented
qgroup mechanism.").
The use of no_quota was also buggy in at least two places:
1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
no_quota to 0 instead of 1 when the following condition was true:
is_fstree(ref_root) || !fs_info->quota_enabled
2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
reset a node's no_quota when the condition "!is_fstree(root_objectid)
|| !root->fs_info->quota_enabled" was true but we did it only in
an unused local stack variable, that is, we never reset the no_quota
value in the node itself.
This fixes the remainder of problems several people have been having when
running delayed references, mostly while a balance is running in parallel,
on a 4.2+ kernel.
Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).
Also, this fixes deadlock issue when using the clone ioctl with qgroups
enabled, as reported by Elias Probst in the mailing list. The deadlock
happens because after calling btrfs_insert_empty_item we have our path
holding a write lock on a leaf of the fs/subvol tree and then before
releasing the path we called check_ref() which did backref walking, when
qgroups are enabled, and tried to read lock the same leaf. The trace for
this case is the following:
INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
(...)
Call Trace:
[<ffffffff86999201>] schedule+0x74/0x83
[<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
[<ffffffff86137ed7>] ? wait_woken+0x74/0x74
[<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
[<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
[<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
[<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
[<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
[<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
[<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
[<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
[<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
[<ffffffff863e852e>] check_ref+0x64/0xc4
[<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
[<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
[<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
[<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
(...)
The problem goes away by eleminating check_ref(), which no longer is
needed as its purpose was to get a value for the no_quota field of
a delayed reference (this patch removes the no_quota field as mentioned
earlier).
Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Reported-by: Elias Probst <mail@eliasprobst.eu>
Reported-by: Peter Becker <floyd.net@gmail.com>
Reported-by: Malte Schröder <malte@tnxip.de>
Reported-by: Derek Dongray <derek@valedon.co.uk>
Reported-by: Erkki Seppala <flux-btrfs@inside.org>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-10-23 13:52:54 +07:00
|
|
|
u64 owner, u64 offset);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
|
Btrfs: fix broken free space cache after the system crashed
When we mounted the filesystem after the crash, we got the following
message:
BTRFS error (device xxx): block group xxxx has wrong amount of free space
BTRFS error (device xxx): failed to load free space cache for block group xxx
It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.
There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
space cache
- account the size of the allocated space that is used to store the file
data, if the size is not zero, don't write out the free space cache.
The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 09:42:50 +07:00
|
|
|
int btrfs_free_reserved_extent(struct btrfs_root *root, u64 start, u64 len,
|
|
|
|
int delalloc);
|
2011-11-01 07:52:39 +07:00
|
|
|
int btrfs_free_and_pin_reserved_extent(struct btrfs_root *root,
|
|
|
|
u64 start, u64 len);
|
2012-03-01 20:56:26 +07:00
|
|
|
void btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2007-06-29 02:57:36 +07:00
|
|
|
int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
|
2009-09-12 03:11:19 +07:00
|
|
|
struct btrfs_root *root);
|
2007-04-18 00:26:50 +07:00
|
|
|
int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
|
2008-09-24 00:14:14 +07:00
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 parent,
|
Btrfs: fix regression running delayed references when using qgroups
In the kernel 4.2 merge window we had a big changes to the implementation
of delayed references and qgroups which made the no_quota field of delayed
references not used anymore. More specifically the no_quota field is not
used anymore as of:
commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
Leaving the no_quota field actually prevents delayed references from
getting merged, which in turn cause the following BUG_ON(), at
fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
static int run_delayed_tree_ref(...)
{
(...)
BUG_ON(node->ref_mod != 1);
(...)
}
This happens on a scenario like the following:
1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref2 is incompatible
due to Ref2->no_quota != Ref3->no_quota.
4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref3 is incompatible
due to Ref3->no_quota != Ref4->no_quota.
5) We run delayed references, trigger merging of delayed references,
through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
all other conditions are satisfied too. So Ref1 gets a ref_mod
value of 2.
7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
all other conditions are satisfied too. So Ref2 gets a ref_mod
value of 2.
8) Ref1 and Ref2 aren't merged, because they have different values
for their no_quota field.
9) Delayed reference Ref1 is picked for running (select_delayed_ref()
always prefers references with an action == BTRFS_ADD_DELAYED_REF).
So run_delayed_tree_ref() is called for Ref1 which triggers the
BUG_ON because Ref1->red_mod != 1 (equals 2).
So fix this by removing the no_quota field, as it's not used anymore as
of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented
qgroup mechanism.").
The use of no_quota was also buggy in at least two places:
1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
no_quota to 0 instead of 1 when the following condition was true:
is_fstree(ref_root) || !fs_info->quota_enabled
2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
reset a node's no_quota when the condition "!is_fstree(root_objectid)
|| !root->fs_info->quota_enabled" was true but we did it only in
an unused local stack variable, that is, we never reset the no_quota
value in the node itself.
This fixes the remainder of problems several people have been having when
running delayed references, mostly while a balance is running in parallel,
on a 4.2+ kernel.
Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).
Also, this fixes deadlock issue when using the clone ioctl with qgroups
enabled, as reported by Elias Probst in the mailing list. The deadlock
happens because after calling btrfs_insert_empty_item we have our path
holding a write lock on a leaf of the fs/subvol tree and then before
releasing the path we called check_ref() which did backref walking, when
qgroups are enabled, and tried to read lock the same leaf. The trace for
this case is the following:
INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
(...)
Call Trace:
[<ffffffff86999201>] schedule+0x74/0x83
[<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
[<ffffffff86137ed7>] ? wait_woken+0x74/0x74
[<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
[<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
[<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
[<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
[<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
[<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
[<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
[<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
[<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
[<ffffffff863e852e>] check_ref+0x64/0xc4
[<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
[<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
[<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
[<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
(...)
The problem goes away by eleminating check_ref(), which no longer is
needed as its purpose was to get a value for the no_quota field of
a delayed reference (this patch removes the no_quota field as mentioned
earlier).
Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Reported-by: Elias Probst <mail@eliasprobst.eu>
Reported-by: Peter Becker <floyd.net@gmail.com>
Reported-by: Malte Schröder <malte@tnxip.de>
Reported-by: Derek Dongray <derek@valedon.co.uk>
Reported-by: Erkki Seppala <flux-btrfs@inside.org>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-10-23 13:52:54 +07:00
|
|
|
u64 root_objectid, u64 owner, u64 offset);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
|
2015-04-07 02:46:08 +07:00
|
|
|
int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2007-04-27 03:46:15 +07:00
|
|
|
int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2015-03-03 04:37:31 +07:00
|
|
|
int btrfs_setup_space_cache(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2008-12-12 04:30:39 +07:00
|
|
|
int btrfs_extent_readonly(struct btrfs_root *root, u64 bytenr);
|
2007-04-27 03:46:15 +07:00
|
|
|
int btrfs_free_block_groups(struct btrfs_fs_info *info);
|
|
|
|
int btrfs_read_block_groups(struct btrfs_root *root);
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 03:11:19 +07:00
|
|
|
int btrfs_can_relocate(struct btrfs_root *root, u64 bytenr);
|
2008-03-25 02:01:56 +07:00
|
|
|
int btrfs_make_block_group(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 bytes_used,
|
2008-04-16 02:41:47 +07:00
|
|
|
u64 type, u64 chunk_objectid, u64 chunk_offset,
|
2008-03-25 02:01:56 +07:00
|
|
|
u64 size);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 21:09:34 +07:00
|
|
|
int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
|
Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.
If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.
So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.
If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:
checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-28 04:14:15 +07:00
|
|
|
struct btrfs_root *root, u64 group_start,
|
|
|
|
struct extent_map *em);
|
2014-09-18 22:20:02 +07:00
|
|
|
void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
|
2015-06-15 20:41:19 +07:00
|
|
|
void btrfs_get_block_group_trimming(struct btrfs_block_group_cache *cache);
|
|
|
|
void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *cache);
|
2012-09-12 03:57:25 +07:00
|
|
|
void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
btrfs: fix wrong free space information of btrfs
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.
# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10
It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.
This patch fixes it by calcing the free space that can be used to allocate
chunks.
Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1
This implementation is just likely fake chunk allocation.
After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-05 17:07:31 +07:00
|
|
|
u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data);
|
2009-03-10 23:39:20 +07:00
|
|
|
void btrfs_clear_space_info_full(struct btrfs_fs_info *info);
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 18:33:38 +07:00
|
|
|
|
|
|
|
enum btrfs_reserve_flush_enum {
|
|
|
|
/* If we are in the transaction, we can't flush anything.*/
|
|
|
|
BTRFS_RESERVE_NO_FLUSH,
|
|
|
|
/*
|
|
|
|
* Flushing delalloc may cause deadlock somewhere, in this
|
|
|
|
* case, use FLUSH LIMIT
|
|
|
|
*/
|
|
|
|
BTRFS_RESERVE_FLUSH_LIMIT,
|
|
|
|
BTRFS_RESERVE_FLUSH_ALL,
|
|
|
|
};
|
|
|
|
|
2015-09-08 16:25:55 +07:00
|
|
|
int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
|
2015-09-08 16:22:42 +07:00
|
|
|
int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes);
|
2015-09-08 16:25:55 +07:00
|
|
|
void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
|
2015-10-08 17:19:37 +07:00
|
|
|
void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
|
|
|
|
u64 len);
|
2010-05-16 21:48:46 +07:00
|
|
|
void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
Btrfs: fix -ENOSPC when finishing block group creation
While creating a block group, we often end up getting ENOSPC while updating
the chunk tree, which leads to a transaction abortion that produces a trace
like the following:
[30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
[30670.117777] BTRFS: Transaction aborted (error -28)
(...)
[30670.163567] Call Trace:
[30670.163906] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[30670.164522] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[30670.165171] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[30670.166323] [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.167213] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[30670.167862] [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.169116] [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
[30670.170593] [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs]
[30670.171960] [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs]
[30670.174649] [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[30670.176092] [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs]
[30670.177218] [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15
[30670.178622] [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de
[30670.179642] [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
[30670.180692] [<ffffffff81152863>] SyS_fallocate+0x47/0x62
[30670.186737] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[30670.187792] ---[ end trace 0373e6b491c4a8cc ]---
This is because we don't do proper space reservation for the chunk block
reserve when we have multiple tasks allocating chunks in parallel.
So block group creation has 2 phases, and the first phase essentially
checks if there is enough space in the system space_info, allocating a
new system chunk if there isn't, while the second phase updates the
device, extent and chunk trees. However, because the updates to the
chunk tree happen in the second phase, if we have N tasks, each with
its own transaction handle, allocating new chunks in parallel and if
there is only enough space in the system space_info to allocate M chunks,
where M < N, none of the tasks ends up allocating a new system chunk in
the first phase and N - M tasks will get -ENOSPC when attempting to
update the chunk tree in phase 2 if they need to COW any nodes/leafs
from the chunk tree.
Fix this by doing proper reservation in the chunk block reserve.
The issue could be reproduced by running fstests generic/038 in a loop,
which eventually triggered the problem.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-20 20:01:54 +07:00
|
|
|
void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans);
|
2010-05-16 21:49:58 +07:00
|
|
|
int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *inode);
|
|
|
|
void btrfs_orphan_release_metadata(struct inode *inode);
|
2013-02-28 17:04:33 +07:00
|
|
|
int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *rsv,
|
|
|
|
int nitems,
|
2013-07-10 03:37:21 +07:00
|
|
|
u64 *qgroup_reserved, bool use_global_rsv);
|
2013-02-28 17:04:33 +07:00
|
|
|
void btrfs_subvolume_release_metadata(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *rsv,
|
|
|
|
u64 qgroup_reserved);
|
2010-05-16 21:48:47 +07:00
|
|
|
int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
|
|
|
|
void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
|
2015-09-08 16:25:55 +07:00
|
|
|
int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
|
|
|
|
void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
|
2012-09-06 17:02:28 +07:00
|
|
|
void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
|
|
|
|
struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root,
|
|
|
|
unsigned short type);
|
2010-05-16 21:46:25 +07:00
|
|
|
void btrfs_free_block_rsv(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *rsv);
|
2015-04-07 08:17:00 +07:00
|
|
|
void __btrfs_free_block_rsv(struct btrfs_block_rsv *rsv);
|
2011-08-30 23:34:28 +07:00
|
|
|
int btrfs_block_rsv_add(struct btrfs_root *root,
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 18:33:38 +07:00
|
|
|
struct btrfs_block_rsv *block_rsv, u64 num_bytes,
|
|
|
|
enum btrfs_reserve_flush_enum flush);
|
2011-08-30 23:34:28 +07:00
|
|
|
int btrfs_block_rsv_check(struct btrfs_root *root,
|
2011-10-18 23:15:48 +07:00
|
|
|
struct btrfs_block_rsv *block_rsv, int min_factor);
|
|
|
|
int btrfs_block_rsv_refill(struct btrfs_root *root,
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 18:33:38 +07:00
|
|
|
struct btrfs_block_rsv *block_rsv, u64 min_reserved,
|
|
|
|
enum btrfs_reserve_flush_enum flush);
|
2010-05-16 21:46:25 +07:00
|
|
|
int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src_rsv,
|
|
|
|
struct btrfs_block_rsv *dst_rsv,
|
|
|
|
u64 num_bytes);
|
2013-05-30 01:54:47 +07:00
|
|
|
int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_block_rsv *dest, u64 num_bytes,
|
|
|
|
int min_factor);
|
2010-05-16 21:46:25 +07:00
|
|
|
void btrfs_block_rsv_release(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 num_bytes);
|
2015-08-05 15:43:27 +07:00
|
|
|
int btrfs_inc_block_group_ro(struct btrfs_root *root,
|
2010-05-16 21:46:25 +07:00
|
|
|
struct btrfs_block_group_cache *cache);
|
2015-08-05 15:43:27 +07:00
|
|
|
void btrfs_dec_block_group_ro(struct btrfs_root *root,
|
2012-03-01 20:56:26 +07:00
|
|
|
struct btrfs_block_group_cache *cache);
|
2010-06-22 01:48:16 +07:00
|
|
|
void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
|
btrfs: fix wrong free space information of btrfs
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.
# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10
It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.
This patch fixes it by calcing the free space that can be used to allocate
chunks.
Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1
This implementation is just likely fake chunk allocation.
After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-05 17:07:31 +07:00
|
|
|
u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
|
2011-01-06 18:30:25 +07:00
|
|
|
int btrfs_error_unpin_extent_range(struct btrfs_root *root,
|
|
|
|
u64 start, u64 end);
|
2014-12-08 21:01:12 +07:00
|
|
|
int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
|
|
|
|
u64 num_bytes, u64 *actual_bytes);
|
2011-02-17 01:57:04 +07:00
|
|
|
int btrfs_force_chunk_alloc(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 type);
|
2011-03-24 17:24:28 +07:00
|
|
|
int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range);
|
2011-01-06 18:30:25 +07:00
|
|
|
|
2011-03-07 09:13:14 +07:00
|
|
|
int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
|
2012-06-28 23:03:02 +07:00
|
|
|
int btrfs_delayed_refs_qgroup_accounting(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_fs_info *fs_info);
|
2012-11-21 21:18:10 +07:00
|
|
|
int __get_raid_index(u64 flags);
|
Btrfs: fix snapshot inconsistency after a file write followed by truncate
If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.
For example, if we perform the following file operations:
$ mkfs.btrfs -f /dev/vdd
$ mount /dev/vdd /mnt
$ xfs_io -f \
-c "pwrite -S 0xaa -b 32K 0 32K" \
-c "fsync" \
-c "pwrite -S 0xbb -b 32770 16K 32770" \
-c "truncate 90123" \
/mnt/foobar
and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:
item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
inode ref index 282 namelen 10 name: foobar
item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
extent data disk byte 1104855040 nr 32768
extent data offset 0 nr 32768 ram 32768
extent compression 0
item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
extent data disk byte 0 nr 0
extent data offset 0 nr 40960 ram 40960
extent compression 0
There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.
Btrfs' fsck tool complains about such cases with a message like the following:
"root 331 inode 257 errors 100, file extent discount"
>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:
1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured
But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).
A test case for xfstests follows.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-29 18:57:59 +07:00
|
|
|
int btrfs_start_write_no_snapshoting(struct btrfs_root *root);
|
|
|
|
void btrfs_end_write_no_snapshoting(struct btrfs_root *root);
|
Btrfs: fix -ENOSPC on block group removal
Unlike when attempting to allocate a new block group, where we check
that we have enough space in the system space_info to update the device
items and insert a new chunk item in the chunk tree, we were not checking
if the system space_info had enough space for updating the device items
and deleting the chunk item in the chunk tree. This often lead to -ENOSPC
error when attempting to allocate blocks for the chunk tree (during btree
node/leaf COW operations) while updating the device items or deleting the
chunk item, which resulted in the current transaction being aborted and
turning the filesystem into read-only mode.
While running fstests generic/038, which stresses allocation of block
groups and removal of unused block groups, with a large scratch device
(750Gb) this happened often, despite more than enough unallocated space,
and resulted in the following trace:
[68663.586604] WARNING: CPU: 3 PID: 1521 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[68663.600407] BTRFS: Transaction aborted (error -28)
(...)
[68663.730829] Call Trace:
[68663.732585] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[68663.734334] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[68663.739980] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[68663.757153] [<ffffffffa036ca6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[68663.760925] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[68663.762854] [<ffffffffa03b159d>] ? btrfs_update_device+0x15a/0x16c [btrfs]
[68663.764073] [<ffffffffa036ca6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[68663.765130] [<ffffffffa03b3638>] btrfs_remove_chunk+0x597/0x5ee [btrfs]
[68663.765998] [<ffffffffa0384663>] ? btrfs_delete_unused_bgs+0x245/0x296 [btrfs]
[68663.767068] [<ffffffffa0384676>] btrfs_delete_unused_bgs+0x258/0x296 [btrfs]
[68663.768227] [<ffffffff8143527f>] ? _raw_spin_unlock_irq+0x2d/0x4c
[68663.769081] [<ffffffffa038b109>] cleaner_kthread+0x13d/0x16c [btrfs]
[68663.799485] [<ffffffffa038afcc>] ? btrfs_alloc_root+0x28/0x28 [btrfs]
[68663.809208] [<ffffffff8105f367>] kthread+0xef/0xf7
[68663.828795] [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
[68663.844942] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
[68663.846486] [<ffffffff81435a88>] ret_from_fork+0x58/0x90
[68663.847760] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
[68663.849503] ---[ end trace 798477c6d6dbaad6 ]---
[68663.850525] BTRFS: error (device sdc) in btrfs_remove_chunk:2652: errno=-28 No space left
So fix this by verifying that enough space exists in system space_info,
and reserving the space in the chunk block reserve, before attempting to
delete the block group and allocate a new system chunk if we don't have
enough space to perform the necessary updates and delete in the chunk
tree. Like for the block group creation case, we don't error our if we
fail to allocate a new system chunk, since we might end up not needing
it (no node/leaf splits happen during the COW operations and/or we end
up not needing to COW any btree nodes or leafs because they were already
COWed in the current transaction and their writeback didn't start yet).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-20 20:01:55 +07:00
|
|
|
void check_system_chunk(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
2015-06-09 23:48:21 +07:00
|
|
|
const u64 type);
|
2007-03-27 03:00:06 +07:00
|
|
|
/* ctree.c */
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
int btrfs_bin_search(struct extent_buffer *eb, struct btrfs_key *key,
|
|
|
|
int level, int *slot);
|
|
|
|
int btrfs_comp_cpu_keys(struct btrfs_key *k1, struct btrfs_key *k2);
|
2008-03-25 02:01:56 +07:00
|
|
|
int btrfs_previous_item(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 min_objectid,
|
|
|
|
int type);
|
2014-01-12 20:38:33 +07:00
|
|
|
int btrfs_previous_extent_item(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 min_objectid);
|
2014-11-12 11:43:09 +07:00
|
|
|
void btrfs_set_item_key_safe(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_path *path,
|
2012-03-01 20:56:26 +07:00
|
|
|
struct btrfs_key *new_key);
|
2008-06-26 03:01:30 +07:00
|
|
|
struct extent_buffer *btrfs_root_node(struct btrfs_root *root);
|
|
|
|
struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root);
|
2008-06-26 03:01:31 +07:00
|
|
|
int btrfs_find_next_key(struct btrfs_root *root, struct btrfs_path *path,
|
2008-06-26 03:01:31 +07:00
|
|
|
struct btrfs_key *key, int lowest_level,
|
2013-02-01 01:21:12 +07:00
|
|
|
u64 min_trans);
|
2008-06-26 03:01:31 +07:00
|
|
|
int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
|
2013-02-01 01:21:12 +07:00
|
|
|
struct btrfs_path *path,
|
2008-06-26 03:01:31 +07:00
|
|
|
u64 min_trans);
|
2012-06-06 02:07:48 +07:00
|
|
|
enum btrfs_compare_tree_result {
|
|
|
|
BTRFS_COMPARE_TREE_NEW,
|
|
|
|
BTRFS_COMPARE_TREE_DELETED,
|
|
|
|
BTRFS_COMPARE_TREE_CHANGED,
|
2013-08-17 03:52:55 +07:00
|
|
|
BTRFS_COMPARE_TREE_SAME,
|
2012-06-06 02:07:48 +07:00
|
|
|
};
|
|
|
|
typedef int (*btrfs_changed_cb_t)(struct btrfs_root *left_root,
|
|
|
|
struct btrfs_root *right_root,
|
|
|
|
struct btrfs_path *left_path,
|
|
|
|
struct btrfs_path *right_path,
|
|
|
|
struct btrfs_key *key,
|
|
|
|
enum btrfs_compare_tree_result result,
|
|
|
|
void *ctx);
|
|
|
|
int btrfs_compare_trees(struct btrfs_root *left_root,
|
|
|
|
struct btrfs_root *right_root,
|
|
|
|
btrfs_changed_cb_t cb, void *ctx);
|
2007-10-16 03:14:19 +07:00
|
|
|
int btrfs_cow_block(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, struct extent_buffer *buf,
|
|
|
|
struct extent_buffer *parent, int parent_slot,
|
2009-03-13 21:24:59 +07:00
|
|
|
struct extent_buffer **cow_ret);
|
2007-12-18 08:14:01 +07:00
|
|
|
int btrfs_copy_root(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct extent_buffer *buf,
|
|
|
|
struct extent_buffer **cow_ret, u64 new_root_objectid);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
int btrfs_block_can_be_shared(struct btrfs_root *root,
|
|
|
|
struct extent_buffer *buf);
|
2013-04-16 12:18:49 +07:00
|
|
|
void btrfs_extend_item(struct btrfs_root *root, struct btrfs_path *path,
|
2012-03-01 20:56:26 +07:00
|
|
|
u32 data_size);
|
2013-04-16 12:18:22 +07:00
|
|
|
void btrfs_truncate_item(struct btrfs_root *root, struct btrfs_path *path,
|
2012-03-01 20:56:26 +07:00
|
|
|
u32 new_size, int from_end);
|
2008-12-10 21:10:46 +07:00
|
|
|
int btrfs_split_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_key *new_key,
|
|
|
|
unsigned long split_offset);
|
2009-11-12 16:33:58 +07:00
|
|
|
int btrfs_duplicate_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_key *new_key);
|
2013-11-05 10:33:33 +07:00
|
|
|
int btrfs_find_item(struct btrfs_root *fs_root, struct btrfs_path *path,
|
|
|
|
u64 inum, u64 ioff, u8 key_type, struct btrfs_key *found_key);
|
2007-03-17 03:20:31 +07:00
|
|
|
int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root
|
|
|
|
*root, struct btrfs_key *key, struct btrfs_path *p, int
|
|
|
|
ins_len, int cow);
|
2012-05-16 23:25:47 +07:00
|
|
|
int btrfs_search_old_slot(struct btrfs_root *root, struct btrfs_key *key,
|
|
|
|
struct btrfs_path *p, u64 time_seq);
|
2011-09-13 16:18:10 +07:00
|
|
|
int btrfs_search_slot_for_read(struct btrfs_root *root,
|
|
|
|
struct btrfs_key *key, struct btrfs_path *p,
|
|
|
|
int find_higher, int return_any);
|
2007-08-08 03:15:09 +07:00
|
|
|
int btrfs_realloc_node(struct btrfs_trans_handle *trans,
|
2007-10-16 03:14:19 +07:00
|
|
|
struct btrfs_root *root, struct extent_buffer *parent,
|
2013-02-01 01:21:12 +07:00
|
|
|
int start_slot, u64 *last_ret,
|
2007-10-16 03:22:39 +07:00
|
|
|
struct btrfs_key *progress);
|
2011-04-21 06:20:15 +07:00
|
|
|
void btrfs_release_path(struct btrfs_path *p);
|
2007-04-02 21:50:19 +07:00
|
|
|
struct btrfs_path *btrfs_alloc_path(void);
|
|
|
|
void btrfs_free_path(struct btrfs_path *p);
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 21:25:08 +07:00
|
|
|
void btrfs_set_path_blocking(struct btrfs_path *p);
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 17:12:22 +07:00
|
|
|
void btrfs_clear_path_blocking(struct btrfs_path *p,
|
2011-07-17 02:23:14 +07:00
|
|
|
struct extent_buffer *held, int held_rw);
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 21:25:08 +07:00
|
|
|
void btrfs_unlock_up_safe(struct btrfs_path *p, int level);
|
|
|
|
|
2008-01-30 03:11:36 +07:00
|
|
|
int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, int slot, int nr);
|
|
|
|
static inline int btrfs_del_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path)
|
|
|
|
{
|
|
|
|
return btrfs_del_items(trans, root, path, path->slots[0], 1);
|
|
|
|
}
|
|
|
|
|
2013-04-16 12:18:22 +07:00
|
|
|
void setup_items_for_insert(struct btrfs_root *root, struct btrfs_path *path,
|
2012-03-01 20:56:26 +07:00
|
|
|
struct btrfs_key *cpu_key, u32 *data_size,
|
|
|
|
u32 total_data, u32 total_size, int nr);
|
2007-03-17 03:20:31 +07:00
|
|
|
int btrfs_insert_item(struct btrfs_trans_handle *trans, struct btrfs_root
|
|
|
|
*root, struct btrfs_key *key, void *data, u32 data_size);
|
2008-01-30 03:15:18 +07:00
|
|
|
int btrfs_insert_empty_items(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_key *cpu_key, u32 *data_size, int nr);
|
|
|
|
|
|
|
|
static inline int btrfs_insert_empty_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_key *key,
|
|
|
|
u32 data_size)
|
|
|
|
{
|
|
|
|
return btrfs_insert_empty_items(trans, root, path, key, &data_size, 1);
|
|
|
|
}
|
|
|
|
|
2007-03-13 21:46:10 +07:00
|
|
|
int btrfs_next_leaf(struct btrfs_root *root, struct btrfs_path *path);
|
2013-10-22 23:18:51 +07:00
|
|
|
int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path);
|
2012-06-11 13:29:29 +07:00
|
|
|
int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
|
|
|
|
u64 time_seq);
|
2012-06-19 20:42:25 +07:00
|
|
|
static inline int btrfs_next_old_item(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *p, u64 time_seq)
|
2011-11-22 21:14:33 +07:00
|
|
|
{
|
|
|
|
++p->slots[0];
|
|
|
|
if (p->slots[0] >= btrfs_header_nritems(p->nodes[0]))
|
2012-06-19 20:42:25 +07:00
|
|
|
return btrfs_next_old_leaf(root, p, time_seq);
|
2011-11-22 21:14:33 +07:00
|
|
|
return 0;
|
|
|
|
}
|
2012-06-19 20:42:25 +07:00
|
|
|
static inline int btrfs_next_item(struct btrfs_root *root, struct btrfs_path *p)
|
|
|
|
{
|
|
|
|
return btrfs_next_old_item(root, p, 0);
|
|
|
|
}
|
2007-10-16 03:14:19 +07:00
|
|
|
int btrfs_leaf_free_space(struct btrfs_root *root, struct extent_buffer *leaf);
|
2011-10-04 10:22:41 +07:00
|
|
|
int __must_check btrfs_drop_snapshot(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv,
|
|
|
|
int update_ref, int for_reloc);
|
2008-10-30 01:49:05 +07:00
|
|
|
int btrfs_drop_subtree(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct extent_buffer *node,
|
|
|
|
struct extent_buffer *parent);
|
2011-05-31 23:07:27 +07:00
|
|
|
static inline int btrfs_fs_closing(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Get synced with close_ctree()
|
|
|
|
*/
|
|
|
|
smp_mb();
|
|
|
|
return fs_info->closing;
|
|
|
|
}
|
2013-05-14 17:20:43 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we remount the fs to be R/O or umount the fs, the cleaner needn't do
|
|
|
|
* anything except sleeping. This function is used to check the status of
|
|
|
|
* the fs.
|
|
|
|
*/
|
|
|
|
static inline int btrfs_need_cleaner_sleep(struct btrfs_root *root)
|
|
|
|
{
|
|
|
|
return (root->fs_info->sb->s_flags & MS_RDONLY ||
|
|
|
|
btrfs_fs_closing(root->fs_info));
|
|
|
|
}
|
|
|
|
|
2011-04-13 20:41:04 +07:00
|
|
|
static inline void free_fs_info(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
2012-01-17 03:04:49 +07:00
|
|
|
kfree(fs_info->balance_ctl);
|
2011-04-13 20:41:04 +07:00
|
|
|
kfree(fs_info->delayed_root);
|
|
|
|
kfree(fs_info->extent_root);
|
|
|
|
kfree(fs_info->tree_root);
|
|
|
|
kfree(fs_info->chunk_root);
|
|
|
|
kfree(fs_info->dev_root);
|
|
|
|
kfree(fs_info->csum_root);
|
2011-09-13 20:23:30 +07:00
|
|
|
kfree(fs_info->quota_root);
|
2013-08-25 01:51:06 +07:00
|
|
|
kfree(fs_info->uuid_root);
|
2011-04-13 20:41:04 +07:00
|
|
|
kfree(fs_info->super_copy);
|
|
|
|
kfree(fs_info->super_for_commit);
|
2014-09-23 12:40:08 +07:00
|
|
|
security_free_mnt_opts(&fs_info->security_opts);
|
2011-04-13 20:41:04 +07:00
|
|
|
kfree(fs_info);
|
|
|
|
}
|
2011-05-31 23:07:27 +07:00
|
|
|
|
2012-06-21 16:08:04 +07:00
|
|
|
/* tree mod log functions from ctree.c */
|
|
|
|
u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info,
|
|
|
|
struct seq_list *elem);
|
|
|
|
void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info,
|
|
|
|
struct seq_list *elem);
|
2012-10-23 16:28:27 +07:00
|
|
|
int btrfs_old_root_level(struct btrfs_root *root, u64 time_seq);
|
2012-06-21 16:08:04 +07:00
|
|
|
|
2007-03-27 03:00:06 +07:00
|
|
|
/* root-item.c */
|
2008-11-18 09:14:24 +07:00
|
|
|
int btrfs_find_root_ref(struct btrfs_root *tree_root,
|
2009-09-22 02:56:00 +07:00
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 root_id, u64 ref_id);
|
2008-11-18 08:37:39 +07:00
|
|
|
int btrfs_add_root_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *tree_root,
|
2009-09-22 02:56:00 +07:00
|
|
|
u64 root_id, u64 ref_id, u64 dirid, u64 sequence,
|
|
|
|
const char *name, int name_len);
|
|
|
|
int btrfs_del_root_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *tree_root,
|
|
|
|
u64 root_id, u64 ref_id, u64 dirid, u64 *sequence,
|
2008-11-18 08:37:39 +07:00
|
|
|
const char *name, int name_len);
|
2007-03-17 03:20:31 +07:00
|
|
|
int btrfs_del_root(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
|
|
|
struct btrfs_key *key);
|
|
|
|
int btrfs_insert_root(struct btrfs_trans_handle *trans, struct btrfs_root
|
|
|
|
*root, struct btrfs_key *key, struct btrfs_root_item
|
|
|
|
*item);
|
2011-10-04 10:22:44 +07:00
|
|
|
int __must_check btrfs_update_root(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_key *key,
|
|
|
|
struct btrfs_root_item *item);
|
2013-05-15 14:48:19 +07:00
|
|
|
int btrfs_find_root(struct btrfs_root *root, struct btrfs_key *search_key,
|
|
|
|
struct btrfs_path *path, struct btrfs_root_item *root_item,
|
|
|
|
struct btrfs_key *root_key);
|
2009-09-22 03:00:26 +07:00
|
|
|
int btrfs_find_orphan_roots(struct btrfs_root *tree_root);
|
2011-07-15 04:23:06 +07:00
|
|
|
void btrfs_set_root_node(struct btrfs_root_item *item,
|
|
|
|
struct extent_buffer *node);
|
2011-03-28 09:01:25 +07:00
|
|
|
void btrfs_check_and_init_root_item(struct btrfs_root_item *item);
|
2012-07-25 22:35:53 +07:00
|
|
|
void btrfs_update_root_times(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2011-03-28 09:01:25 +07:00
|
|
|
|
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 22:11:17 +07:00
|
|
|
/* uuid-tree.c */
|
|
|
|
int btrfs_uuid_tree_add(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *uuid_root, u8 *uuid, u8 type,
|
|
|
|
u64 subid);
|
|
|
|
int btrfs_uuid_tree_rem(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *uuid_root, u8 *uuid, u8 type,
|
|
|
|
u64 subid);
|
2013-08-15 22:11:23 +07:00
|
|
|
int btrfs_uuid_tree_iterate(struct btrfs_fs_info *fs_info,
|
|
|
|
int (*check_func)(struct btrfs_fs_info *, u8 *, u8,
|
|
|
|
u64));
|
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 22:11:17 +07:00
|
|
|
|
2007-03-27 03:00:06 +07:00
|
|
|
/* dir-item.c */
|
2012-12-18 02:26:57 +07:00
|
|
|
int btrfs_check_dir_item_collision(struct btrfs_root *root, u64 dir,
|
|
|
|
const char *name, int name_len);
|
2009-01-06 09:25:51 +07:00
|
|
|
int btrfs_insert_dir_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, const char *name,
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 17:12:22 +07:00
|
|
|
int name_len, struct inode *dir,
|
2008-07-24 23:12:38 +07:00
|
|
|
struct btrfs_key *location, u8 type, u64 index);
|
2007-04-20 02:36:27 +07:00
|
|
|
struct btrfs_dir_item *btrfs_lookup_dir_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 dir,
|
|
|
|
const char *name, int name_len,
|
|
|
|
int mod);
|
|
|
|
struct btrfs_dir_item *
|
|
|
|
btrfs_lookup_dir_index_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 dir,
|
|
|
|
u64 objectid, const char *name, int name_len,
|
|
|
|
int mod);
|
2009-09-22 02:56:00 +07:00
|
|
|
struct btrfs_dir_item *
|
|
|
|
btrfs_search_dir_index_item(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 dirid,
|
|
|
|
const char *name, int name_len);
|
2007-04-20 02:36:27 +07:00
|
|
|
int btrfs_delete_one_dir_name(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_dir_item *di);
|
2007-11-16 23:45:54 +07:00
|
|
|
int btrfs_insert_xattr_item(struct btrfs_trans_handle *trans,
|
2009-11-12 16:35:27 +07:00
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 objectid,
|
|
|
|
const char *name, u16 name_len,
|
|
|
|
const void *data, u16 data_len);
|
2007-11-16 23:45:54 +07:00
|
|
|
struct btrfs_dir_item *btrfs_lookup_xattr(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 dir,
|
|
|
|
const char *name, u16 name_len,
|
|
|
|
int mod);
|
2011-03-17 03:47:17 +07:00
|
|
|
int verify_dir_item(struct btrfs_root *root,
|
|
|
|
struct extent_buffer *leaf,
|
|
|
|
struct btrfs_dir_item *dir_item);
|
2014-11-09 15:38:39 +07:00
|
|
|
struct btrfs_dir_item *btrfs_match_dir_item_name(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
const char *name,
|
|
|
|
int name_len);
|
2008-07-24 23:17:14 +07:00
|
|
|
|
|
|
|
/* orphan.c */
|
|
|
|
int btrfs_insert_orphan_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 offset);
|
|
|
|
int btrfs_del_orphan_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 offset);
|
2009-09-22 02:56:00 +07:00
|
|
|
int btrfs_find_orphan_item(struct btrfs_root *root, u64 offset);
|
2008-07-24 23:17:14 +07:00
|
|
|
|
2007-03-27 03:00:06 +07:00
|
|
|
/* inode-item.c */
|
2007-12-13 02:38:19 +07:00
|
|
|
int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
const char *name, int name_len,
|
2008-07-24 23:12:38 +07:00
|
|
|
u64 inode_objectid, u64 ref_objectid, u64 index);
|
2007-12-13 02:38:19 +07:00
|
|
|
int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
const char *name, int name_len,
|
2008-07-24 23:12:38 +07:00
|
|
|
u64 inode_objectid, u64 ref_objectid, u64 *index);
|
2007-10-16 03:14:19 +07:00
|
|
|
int btrfs_insert_empty_inode(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 objectid);
|
2007-03-21 02:57:25 +07:00
|
|
|
int btrfs_lookup_inode(struct btrfs_trans_handle *trans, struct btrfs_root
|
2007-04-07 02:37:36 +07:00
|
|
|
*root, struct btrfs_path *path,
|
|
|
|
struct btrfs_key *location, int mod);
|
2007-03-27 03:00:06 +07:00
|
|
|
|
2012-08-09 01:32:27 +07:00
|
|
|
struct btrfs_inode_extref *
|
|
|
|
btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
const char *name, int name_len,
|
|
|
|
u64 inode_objectid, u64 ref_objectid, int ins_len,
|
|
|
|
int cow);
|
|
|
|
|
|
|
|
int btrfs_find_name_in_ext_backref(struct btrfs_path *path,
|
|
|
|
u64 ref_objectid, const char *name,
|
|
|
|
int name_len,
|
|
|
|
struct btrfs_inode_extref **extref_ret);
|
|
|
|
|
2007-03-27 03:00:06 +07:00
|
|
|
/* file-item.c */
|
2013-07-25 18:22:34 +07:00
|
|
|
struct btrfs_dio_private;
|
2008-12-10 21:10:46 +07:00
|
|
|
int btrfs_del_csums(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 bytenr, u64 len);
|
2008-08-01 02:42:53 +07:00
|
|
|
int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 04:58:54 +07:00
|
|
|
struct bio *bio, u32 *dst);
|
2010-05-23 22:00:55 +07:00
|
|
|
int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
|
2014-09-12 17:43:54 +07:00
|
|
|
struct bio *bio, u64 logical_offset);
|
2007-04-18 00:26:50 +07:00
|
|
|
int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 01:49:59 +07:00
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 objectid, u64 pos,
|
|
|
|
u64 disk_offset, u64 disk_num_bytes,
|
|
|
|
u64 num_bytes, u64 offset, u64 ram_bytes,
|
|
|
|
u8 compression, u8 encryption, u16 other_encoding);
|
2007-03-27 03:00:06 +07:00
|
|
|
int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 objectid,
|
2007-10-16 03:15:53 +07:00
|
|
|
u64 bytenr, int mod);
|
2008-02-21 00:07:25 +07:00
|
|
|
int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 04:58:54 +07:00
|
|
|
struct btrfs_root *root,
|
2008-07-17 23:53:50 +07:00
|
|
|
struct btrfs_ordered_sum *sums);
|
2008-07-18 17:17:13 +07:00
|
|
|
int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 04:58:54 +07:00
|
|
|
struct bio *bio, u64 file_start, int contig);
|
2011-03-08 20:14:00 +07:00
|
|
|
int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
|
|
|
|
struct list_head *list, int search_commit);
|
2014-06-09 09:48:05 +07:00
|
|
|
void btrfs_extent_item_to_extent_map(struct inode *inode,
|
|
|
|
const struct btrfs_path *path,
|
|
|
|
struct btrfs_file_extent_item *fi,
|
|
|
|
const bool new_inline,
|
|
|
|
struct extent_map *em);
|
|
|
|
|
2007-06-12 17:35:45 +07:00
|
|
|
/* inode.c */
|
2012-10-25 16:28:04 +07:00
|
|
|
struct btrfs_delalloc_work {
|
|
|
|
struct inode *inode;
|
|
|
|
int wait;
|
|
|
|
int delay_iput;
|
|
|
|
struct completion completion;
|
|
|
|
struct list_head list;
|
|
|
|
struct btrfs_work work;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct btrfs_delalloc_work *btrfs_alloc_delalloc_work(struct inode *inode,
|
|
|
|
int wait, int delay_iput);
|
|
|
|
void btrfs_wait_and_free_delalloc_work(struct btrfs_delalloc_work *work);
|
|
|
|
|
2011-07-19 00:21:36 +07:00
|
|
|
struct extent_map *btrfs_get_extent_fiemap(struct inode *inode, struct page *page,
|
|
|
|
size_t pg_offset, u64 start, u64 len,
|
|
|
|
int create);
|
2013-08-15 01:02:47 +07:00
|
|
|
noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
|
2013-06-22 03:37:03 +07:00
|
|
|
u64 *orig_start, u64 *orig_block_len,
|
|
|
|
u64 *ram_bytes);
|
2008-07-24 20:51:08 +07:00
|
|
|
|
|
|
|
/* RHEL and EL kernels have a patch that renames PG_checked to FsMisc */
|
2008-08-07 22:19:42 +07:00
|
|
|
#if defined(ClearPageFsMisc) && !defined(ClearPageChecked)
|
2008-07-24 20:51:08 +07:00
|
|
|
#define ClearPageChecked ClearPageFsMisc
|
|
|
|
#define SetPageChecked SetPageFsMisc
|
|
|
|
#define PageChecked PageFsMisc
|
|
|
|
#endif
|
|
|
|
|
2011-07-20 10:46:35 +07:00
|
|
|
/* This forces readahead on a given range of bytes in an inode */
|
|
|
|
static inline void btrfs_force_ra(struct address_space *mapping,
|
|
|
|
struct file_ra_state *ra, struct file *file,
|
|
|
|
pgoff_t offset, unsigned long req_size)
|
|
|
|
{
|
|
|
|
page_cache_sync_readahead(mapping, ra, file, offset, req_size);
|
|
|
|
}
|
|
|
|
|
2008-11-18 09:02:50 +07:00
|
|
|
struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry);
|
|
|
|
int btrfs_set_inode_index(struct inode *dir, u64 *index);
|
2008-09-06 03:13:11 +07:00
|
|
|
int btrfs_unlink_inode(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct inode *dir, struct inode *inode,
|
|
|
|
const char *name, int name_len);
|
|
|
|
int btrfs_add_link(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *parent_inode, struct inode *inode,
|
|
|
|
const char *name, int name_len, int add_backref, u64 index);
|
2009-09-22 02:56:00 +07:00
|
|
|
int btrfs_unlink_subvol(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct inode *dir, u64 objectid,
|
|
|
|
const char *name, int name_len);
|
2012-08-30 01:27:18 +07:00
|
|
|
int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len,
|
|
|
|
int front);
|
2008-09-06 03:13:11 +07:00
|
|
|
int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct inode *inode, u64 new_size,
|
|
|
|
u32 min_type);
|
|
|
|
|
2009-11-12 16:36:34 +07:00
|
|
|
int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput);
|
2014-03-06 12:55:01 +07:00
|
|
|
int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput,
|
|
|
|
int nr);
|
2010-02-04 02:33:23 +07:00
|
|
|
int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
|
|
|
|
struct extent_state **cached_state);
|
2008-12-12 04:30:39 +07:00
|
|
|
int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
|
Btrfs: add support for inode properties
This change adds infrastructure to allow for generic properties for
inodes. Properties are name/value pairs that can be associated with
inodes for different purposes. They are stored as xattrs with the
prefix "btrfs."
Properties can be inherited - this means when a directory inode has
inheritable properties set, these are added to new inodes created
under that directory. Further, subvolumes can also have properties
associated with them, and they can be inherited from their parent
subvolume. Naturally, directory properties have priority over subvolume
properties (in practice a subvolume property is just a regular
property associated with the root inode, objectid 256, of the
subvolume's fs tree).
This change also adds one specific property implementation, named
"compression", whose values can be "lzo" or "zlib" and it's an
inheritable property.
The corresponding changes to btrfs-progs were also implemented.
A patch with xfstests for this feature will follow once there's
agreement on this change/feature.
Further, the script at the bottom of this commit message was used to
do some benchmarks to measure any performance penalties of this feature.
Basically the tests correspond to:
Test 1 - create a filesystem and mount it with compress-force=lzo,
then sequentially create N files of 64Kb each, measure how long it took
to create the files, unmount the filesystem, mount the filesystem and
perform an 'ls -lha' against the test directory holding the N files, and
report the time the command took.
Test 2 - create a filesystem and don't use any compression option when
mounting it - instead set the compression property of the subvolume's
root to 'lzo'. Then create N files of 64Kb, and report the time it took.
The unmount the filesystem, mount it again and perform an 'ls -lha' like
in the former test. This means every single file ends up with a property
(xattr) associated to it.
Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
compression property, have no real effect other than adding more work
when inheriting properties and taking more btree leaf space.
Test 4 - same as test 3 but with 10 properties per file.
Results (in seconds, and averages of 5 runs each), for different N
numbers of files follow.
* Without properties (test 1)
file creation time ls -lha time
10 000 files 3.49 0.76
100 000 files 47.19 8.37
1 000 000 files 518.51 107.06
* With 1 property (compression property set to lzo - test 2)
file creation time ls -lha time
10 000 files 3.63 0.93
100 000 files 48.56 9.74
1 000 000 files 537.72 125.11
* With 4 properties (test 3)
file creation time ls -lha time
10 000 files 3.94 1.20
100 000 files 52.14 11.48
1 000 000 files 572.70 142.13
* With 10 properties (test 4)
file creation time ls -lha time
10 000 files 4.61 1.35
100 000 files 58.86 13.83
1 000 000 files 656.01 177.61
The increased latencies with properties are essencialy because of:
*) When creating an inode, we now synchronously write 1 more item
(an xattr item) for each property inherited from the parent dir
(or subvolume). This could be done in an asynchronous way such
as we do for dir intex items (delayed-inode.c), which could help
reduce the file creation latency;
*) With properties, we now have larger fs trees. For this particular
test each xattr item uses 75 bytes of leaf space in the fs tree.
This could be less by using a new item for xattr items, instead of
the current btrfs_dir_item, since we could cut the 'location' and
'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
total of 26 bytes per xattr item) from the btrfs_dir_item type.
Also tried batching the xattr insertions (ignoring proper hash
collision handling, since it didn't exist) when creating files that
inherit properties from their parent inode/subvolume, but the end
results were (surprisingly) essentially the same.
Test script:
$ cat test.pl
#!/usr/bin/perl -w
use strict;
use Time::HiRes qw(time);
use constant NUM_FILES => 10_000;
use constant FILE_SIZES => (64 * 1024);
use constant DEV => '/dev/sdb4';
use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
use constant TEST_DIR => (MNT_POINT . '/testdir');
system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
# following line for testing without properties
#system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
# following 2 lines for testing with properties
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
my ($t1, $t2);
$t1 = time();
for (my $i = 1; $i <= NUM_FILES; $i++) {
my $p = TEST_DIR . '/file_' . $i;
open(my $f, '>', $p) or die "Error opening file!";
$f->autoflush(1);
for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
print $f ('A' x 4096) or die "Error writing to file!";
}
close($f);
}
$t2 = time();
print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
$t1 = time();
system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
$t2 = time();
print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-07 18:47:46 +07:00
|
|
|
struct btrfs_root *new_root,
|
|
|
|
struct btrfs_root *parent_root,
|
|
|
|
u64 new_dirid);
|
2009-07-16 05:29:37 +07:00
|
|
|
int btrfs_merge_bio_hook(int rw, struct page *page, unsigned long offset,
|
|
|
|
size_t size, struct bio *bio,
|
|
|
|
unsigned long bio_flags);
|
2009-04-01 05:23:21 +07:00
|
|
|
int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
|
2007-06-16 00:50:00 +07:00
|
|
|
int btrfs_readpage(struct file *file, struct page *page);
|
2010-06-07 22:35:40 +07:00
|
|
|
void btrfs_evict_inode(struct inode *inode);
|
2010-03-05 15:21:37 +07:00
|
|
|
int btrfs_write_inode(struct inode *inode, struct writeback_control *wbc);
|
2007-06-12 17:35:45 +07:00
|
|
|
struct inode *btrfs_alloc_inode(struct super_block *sb);
|
|
|
|
void btrfs_destroy_inode(struct inode *inode);
|
2010-06-08 00:43:19 +07:00
|
|
|
int btrfs_drop_inode(struct inode *inode);
|
2007-06-12 17:35:45 +07:00
|
|
|
int btrfs_init_cachep(void);
|
|
|
|
void btrfs_destroy_cachep(void);
|
2008-06-10 21:07:39 +07:00
|
|
|
long btrfs_ioctl_trans_end(struct file *file);
|
2008-07-21 03:31:04 +07:00
|
|
|
struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
|
Btrfs: change how we mount subvolumes
This work is in preperation for being able to set a different root as the
default mounting root.
There is currently a problem with how we mount subvolumes. We cannot currently
mount a subvolume of a subvolume, you can only mount subvolumes/snapshots of the
default subvolume. So say you take a snapshot of the default subvolume and call
it snap1, and then take a snapshot of snap1 and call it snap2, so now you have
/
/snap1
/snap1/snap2
as your available volumes. Currently you can only mount / and /snap1,
you cannot mount /snap1/snap2. To fix this problem instead of passing
subvolid=<name> you must pass in subvolid=<treeid>, where <treeid> is
the tree id that gets spit out via the subvolume listing you get from
the subvolume listing patches (btrfs filesystem list). This allows us
to mount /, /snap1 and /snap1/snap2 as the root volume.
In addition to the above, we also now read the default dir item in the
tree root to get the root key that it points to. For now this just
points at what has always been the default subvolme, but later on I plan
to change it to point at whatever root you want to be the new default
root, so you can just set the default mount and not have to mount with
-o subvolid=<treeid>. I tested this out with the above scenario and it
worked perfectly. Thanks,
mount -o subvol operates inside the selected subvolid. For example:
mount -o subvol=snap1,subvolid=256 /dev/xxx /mnt
/mnt will have the snap1 directory for the subvolume with id
256.
mount -o subvol=snap /dev/xxx /mnt
/mnt will be the snap directory of whatever the default subvolume
is.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-05 00:38:27 +07:00
|
|
|
struct btrfs_root *root, int *was_new);
|
2007-08-28 03:49:44 +07:00
|
|
|
struct extent_map *btrfs_get_extent(struct inode *inode, struct page *page,
|
2011-04-19 19:29:38 +07:00
|
|
|
size_t pg_offset, u64 start, u64 end,
|
2007-08-28 03:49:44 +07:00
|
|
|
int create);
|
|
|
|
int btrfs_update_inode(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct inode *inode);
|
2012-10-23 02:43:12 +07:00
|
|
|
int btrfs_update_inode_fallback(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, struct inode *inode);
|
2008-09-26 21:05:38 +07:00
|
|
|
int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode);
|
2011-02-01 04:22:42 +07:00
|
|
|
int btrfs_orphan_cleanup(struct btrfs_root *root);
|
2010-05-16 21:49:58 +07:00
|
|
|
void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2011-02-01 03:30:16 +07:00
|
|
|
int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
|
2012-03-01 20:56:26 +07:00
|
|
|
void btrfs_invalidate_inodes(struct btrfs_root *root);
|
2009-11-12 16:36:34 +07:00
|
|
|
void btrfs_add_delayed_iput(struct inode *inode);
|
|
|
|
void btrfs_run_delayed_iputs(struct btrfs_root *root);
|
2010-05-16 21:49:59 +07:00
|
|
|
int btrfs_prealloc_file_range(struct inode *inode, int mode,
|
|
|
|
u64 start, u64 num_bytes, u64 min_size,
|
|
|
|
loff_t actual_len, u64 *alloc_hint);
|
2010-06-22 01:48:16 +07:00
|
|
|
int btrfs_prealloc_file_range_trans(struct inode *inode,
|
|
|
|
struct btrfs_trans_handle *trans, int mode,
|
|
|
|
u64 start, u64 num_bytes, u64 min_size,
|
|
|
|
loff_t actual_len, u64 *alloc_hint);
|
Btrfs: ensure ordered extent errors aren't missed on fsync
When doing a fsync with a fast path we have a time window where we can miss
the fact that writeback of some file data failed, and therefore we endup
returning success (0) from fsync when we should return an error.
The steps that lead to this are the following:
1) We start all ordered extents by calling filemap_fdatawrite_range();
2) We do some other work like locking the inode's i_mutex, start a transaction,
start a log transaction, etc;
3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
ordered extents from inode's ordered tree into a list;
4) But by the time we do ordered extent collection, some ordered extents we started
at step 1) might have already completed with an error, and therefore we didn't
found them in the ordered tree and had no idea they finished with an error. This
makes our fsync return success (0) to userspace, but has no bad effects on the log
like for example insertion of file extent items into the log that point to unwritten
extents, because the invalid extent maps were removed before the ordered extent
completed (in inode.c:btrfs_finish_ordered_io).
So after collecting the ordered extents just check if the inode's i_mapping has any
error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
writeback fails for a page of an ordered extent, we call mapping_set_error (done in
extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
that sets one of those error flags in the inode's i_mapping flags.
This change also has the side effect of fixing the issue where for fast fsyncs we
never checked/cleared the error flags from the inode's i_mapping flags, which means
that a full fsync performed after a fast fsync could get such errors that belonged
to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
calls filemap_fdatawait_range(), and the later checks for and clears those flags,
while for fast fsyncs we never call filemap_fdatawait_range() or anything else
that checks for and clears the error flags from the inode's i_mapping.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-14 00:01:45 +07:00
|
|
|
int btrfs_inode_check_errors(struct inode *inode);
|
2009-10-09 20:54:36 +07:00
|
|
|
extern const struct dentry_operations btrfs_dentry_operations;
|
2015-03-17 04:38:52 +07:00
|
|
|
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
|
|
|
|
void btrfs_test_inode_set_ops(struct inode *inode);
|
|
|
|
#endif
|
2008-06-12 08:53:53 +07:00
|
|
|
|
|
|
|
/* ioctl.c */
|
|
|
|
long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
|
2009-04-17 15:37:41 +07:00
|
|
|
void btrfs_update_iflags(struct inode *inode);
|
|
|
|
void btrfs_inherit_iflags(struct inode *inode, struct inode *dir);
|
2013-08-15 22:11:20 +07:00
|
|
|
int btrfs_is_empty_uuid(u8 *uuid);
|
2011-05-25 02:35:30 +07:00
|
|
|
int btrfs_defrag_file(struct inode *inode, struct file *file,
|
|
|
|
struct btrfs_ioctl_defrag_range_args *range,
|
|
|
|
u64 newer_than, unsigned long max_pages);
|
2012-08-01 23:56:49 +07:00
|
|
|
void btrfs_get_block_group_info(struct list_head *groups_list,
|
|
|
|
struct btrfs_ioctl_space_info *space);
|
2013-08-14 23:12:25 +07:00
|
|
|
void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock,
|
|
|
|
struct btrfs_ioctl_balance_args *bargs);
|
|
|
|
|
2012-08-01 23:56:49 +07:00
|
|
|
|
2007-06-12 17:35:45 +07:00
|
|
|
/* file.c */
|
2012-11-26 16:24:43 +07:00
|
|
|
int btrfs_auto_defrag_init(void);
|
|
|
|
void btrfs_auto_defrag_exit(void);
|
2011-05-25 02:35:30 +07:00
|
|
|
int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *inode);
|
|
|
|
int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
|
2012-11-26 16:26:20 +07:00
|
|
|
void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
|
2011-07-17 07:44:56 +07:00
|
|
|
int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
|
2012-08-31 07:06:49 +07:00
|
|
|
void btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end,
|
|
|
|
int skip_pinned);
|
2009-10-02 05:43:56 +07:00
|
|
|
extern const struct file_operations btrfs_file_operations;
|
Btrfs: turbo charge fsync
At least for the vm workload. Currently on fsync we will
1) Truncate all items in the log tree for the given inode if they exist
and
2) Copy all items for a given inode into the log
The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing. This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them. We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already. Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write
Original Patched
SATA drive 82KB/s 140KB/s
Fusion drive 431KB/s 2532KB/s
So around 2-6 times faster depending on your hardware. There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok. This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get. All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.
The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok. I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-18 00:14:17 +07:00
|
|
|
int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, struct inode *inode,
|
|
|
|
struct btrfs_path *path, u64 start, u64 end,
|
2014-01-07 18:42:27 +07:00
|
|
|
u64 *drop_end, int drop_cache,
|
|
|
|
int replace_extent,
|
|
|
|
u32 extent_item_size,
|
|
|
|
int *key_inserted);
|
Btrfs: turbo charge fsync
At least for the vm workload. Currently on fsync we will
1) Truncate all items in the log tree for the given inode if they exist
and
2) Copy all items for a given inode into the log
The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing. This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them. We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already. Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write
Original Patched
SATA drive 82KB/s 140KB/s
Fusion drive 431KB/s 2532KB/s
So around 2-6 times faster depending on your hardware. There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok. This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get. All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.
The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok. I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-18 00:14:17 +07:00
|
|
|
int btrfs_drop_extents(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, struct inode *inode, u64 start,
|
2012-08-29 23:24:27 +07:00
|
|
|
u64 end, int drop_cache);
|
2008-10-31 01:25:28 +07:00
|
|
|
int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *inode, u64 start, u64 end);
|
2008-06-10 21:07:39 +07:00
|
|
|
int btrfs_release_file(struct inode *inode, struct file *file);
|
2011-04-07 00:05:22 +07:00
|
|
|
int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
|
|
|
|
struct page **pages, size_t num_pages,
|
|
|
|
loff_t pos, size_t write_bytes,
|
|
|
|
struct extent_state **cached);
|
2014-10-10 15:43:11 +07:00
|
|
|
int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
|
2008-06-10 21:07:39 +07:00
|
|
|
|
2007-08-08 03:15:09 +07:00
|
|
|
/* tree-defrag.c */
|
|
|
|
int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
|
2013-02-01 01:21:12 +07:00
|
|
|
struct btrfs_root *root);
|
2007-08-30 02:47:34 +07:00
|
|
|
|
|
|
|
/* sysfs.c */
|
|
|
|
int btrfs_init_sysfs(void);
|
|
|
|
void btrfs_exit_sysfs(void);
|
2015-08-14 17:32:46 +07:00
|
|
|
int btrfs_sysfs_add_mounted(struct btrfs_fs_info *fs_info);
|
2015-08-14 17:32:47 +07:00
|
|
|
void btrfs_sysfs_remove_mounted(struct btrfs_fs_info *fs_info);
|
2007-08-30 02:47:34 +07:00
|
|
|
|
2007-11-16 23:45:54 +07:00
|
|
|
/* xattr.c */
|
|
|
|
ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size);
|
2008-07-24 23:16:03 +07:00
|
|
|
|
2007-12-22 04:27:24 +07:00
|
|
|
/* super.c */
|
2008-06-10 21:40:29 +07:00
|
|
|
int btrfs_parse_options(struct btrfs_root *root, char *options);
|
2008-06-10 21:07:39 +07:00
|
|
|
int btrfs_sync_fs(struct super_block *sb, int wait);
|
2012-07-31 04:40:13 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_PRINTK
|
|
|
|
__printf(2, 3)
|
2013-03-20 05:41:23 +07:00
|
|
|
void btrfs_printk(const struct btrfs_fs_info *fs_info, const char *fmt, ...);
|
2012-07-31 04:40:13 +07:00
|
|
|
#else
|
|
|
|
static inline __printf(2, 3)
|
2013-03-20 05:41:23 +07:00
|
|
|
void btrfs_printk(const struct btrfs_fs_info *fs_info, const char *fmt, ...)
|
2012-07-31 04:40:13 +07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2013-03-20 05:41:23 +07:00
|
|
|
#define btrfs_emerg(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk(fs_info, KERN_EMERG fmt, ##args)
|
|
|
|
#define btrfs_alert(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk(fs_info, KERN_ALERT fmt, ##args)
|
|
|
|
#define btrfs_crit(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk(fs_info, KERN_CRIT fmt, ##args)
|
|
|
|
#define btrfs_err(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk(fs_info, KERN_ERR fmt, ##args)
|
|
|
|
#define btrfs_warn(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk(fs_info, KERN_WARNING fmt, ##args)
|
|
|
|
#define btrfs_notice(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk(fs_info, KERN_NOTICE fmt, ##args)
|
|
|
|
#define btrfs_info(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk(fs_info, KERN_INFO fmt, ##args)
|
2013-11-13 07:22:53 +07:00
|
|
|
|
2015-10-08 13:48:52 +07:00
|
|
|
/*
|
|
|
|
* Wrappers that use printk_in_rcu
|
|
|
|
*/
|
|
|
|
#define btrfs_emerg_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_in_rcu(fs_info, KERN_EMERG fmt, ##args)
|
|
|
|
#define btrfs_alert_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_in_rcu(fs_info, KERN_ALERT fmt, ##args)
|
|
|
|
#define btrfs_crit_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_in_rcu(fs_info, KERN_CRIT fmt, ##args)
|
|
|
|
#define btrfs_err_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_in_rcu(fs_info, KERN_ERR fmt, ##args)
|
|
|
|
#define btrfs_warn_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_in_rcu(fs_info, KERN_WARNING fmt, ##args)
|
|
|
|
#define btrfs_notice_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_in_rcu(fs_info, KERN_NOTICE fmt, ##args)
|
|
|
|
#define btrfs_info_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_in_rcu(fs_info, KERN_INFO fmt, ##args)
|
|
|
|
|
2015-10-08 15:27:02 +07:00
|
|
|
/*
|
|
|
|
* Wrappers that use a ratelimited printk_in_rcu
|
|
|
|
*/
|
|
|
|
#define btrfs_emerg_rl_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_rl_in_rcu(fs_info, KERN_EMERG fmt, ##args)
|
|
|
|
#define btrfs_alert_rl_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_rl_in_rcu(fs_info, KERN_ALERT fmt, ##args)
|
|
|
|
#define btrfs_crit_rl_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_rl_in_rcu(fs_info, KERN_CRIT fmt, ##args)
|
|
|
|
#define btrfs_err_rl_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_rl_in_rcu(fs_info, KERN_ERR fmt, ##args)
|
|
|
|
#define btrfs_warn_rl_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_rl_in_rcu(fs_info, KERN_WARNING fmt, ##args)
|
|
|
|
#define btrfs_notice_rl_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_rl_in_rcu(fs_info, KERN_NOTICE fmt, ##args)
|
|
|
|
#define btrfs_info_rl_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_rl_in_rcu(fs_info, KERN_INFO fmt, ##args)
|
|
|
|
|
2015-10-08 15:51:11 +07:00
|
|
|
/*
|
|
|
|
* Wrappers that use a ratelimited printk
|
|
|
|
*/
|
|
|
|
#define btrfs_emerg_rl(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_ratelimited(fs_info, KERN_EMERG fmt, ##args)
|
|
|
|
#define btrfs_alert_rl(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_ratelimited(fs_info, KERN_ALERT fmt, ##args)
|
|
|
|
#define btrfs_crit_rl(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_ratelimited(fs_info, KERN_CRIT fmt, ##args)
|
|
|
|
#define btrfs_err_rl(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_ratelimited(fs_info, KERN_ERR fmt, ##args)
|
|
|
|
#define btrfs_warn_rl(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_ratelimited(fs_info, KERN_WARNING fmt, ##args)
|
|
|
|
#define btrfs_notice_rl(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_ratelimited(fs_info, KERN_NOTICE fmt, ##args)
|
|
|
|
#define btrfs_info_rl(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_ratelimited(fs_info, KERN_INFO fmt, ##args)
|
2013-11-13 07:22:53 +07:00
|
|
|
#ifdef DEBUG
|
2013-03-20 05:41:23 +07:00
|
|
|
#define btrfs_debug(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk(fs_info, KERN_DEBUG fmt, ##args)
|
2015-10-08 13:48:52 +07:00
|
|
|
#define btrfs_debug_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_in_rcu(fs_info, KERN_DEBUG fmt, ##args)
|
2015-10-08 15:27:02 +07:00
|
|
|
#define btrfs_debug_rl_in_rcu(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_rl_in_rcu(fs_info, KERN_DEBUG fmt, ##args)
|
2015-10-08 15:51:11 +07:00
|
|
|
#define btrfs_debug_rl(fs_info, fmt, args...) \
|
|
|
|
btrfs_printk_ratelimited(fs_info, KERN_DEBUG fmt, ##args)
|
2013-11-13 07:22:53 +07:00
|
|
|
#else
|
|
|
|
#define btrfs_debug(fs_info, fmt, args...) \
|
|
|
|
no_printk(KERN_DEBUG fmt, ##args)
|
2015-10-08 13:48:52 +07:00
|
|
|
#define btrfs_debug_in_rcu(fs_info, fmt, args...) \
|
|
|
|
no_printk(KERN_DEBUG fmt, ##args)
|
2015-10-08 15:27:02 +07:00
|
|
|
#define btrfs_debug_rl_in_rcu(fs_info, fmt, args...) \
|
|
|
|
no_printk(KERN_DEBUG fmt, ##args)
|
2015-10-08 15:51:11 +07:00
|
|
|
#define btrfs_debug_rl(fs_info, fmt, args...) \
|
|
|
|
no_printk(KERN_DEBUG fmt, ##args)
|
2013-11-13 07:22:53 +07:00
|
|
|
#endif
|
2013-03-20 05:41:23 +07:00
|
|
|
|
2015-10-08 13:48:52 +07:00
|
|
|
#define btrfs_printk_in_rcu(fs_info, fmt, args...) \
|
|
|
|
do { \
|
|
|
|
rcu_read_lock(); \
|
|
|
|
btrfs_printk(fs_info, fmt, ##args); \
|
|
|
|
rcu_read_unlock(); \
|
|
|
|
} while (0)
|
|
|
|
|
2015-10-08 15:27:02 +07:00
|
|
|
#define btrfs_printk_ratelimited(fs_info, fmt, args...) \
|
|
|
|
do { \
|
|
|
|
static DEFINE_RATELIMIT_STATE(_rs, \
|
|
|
|
DEFAULT_RATELIMIT_INTERVAL, \
|
|
|
|
DEFAULT_RATELIMIT_BURST); \
|
|
|
|
if (__ratelimit(&_rs)) \
|
|
|
|
btrfs_printk(fs_info, fmt, ##args); \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
#define btrfs_printk_rl_in_rcu(fs_info, fmt, args...) \
|
|
|
|
do { \
|
|
|
|
rcu_read_lock(); \
|
|
|
|
btrfs_printk_ratelimited(fs_info, fmt, ##args); \
|
|
|
|
rcu_read_unlock(); \
|
|
|
|
} while (0)
|
|
|
|
|
2013-08-27 03:53:15 +07:00
|
|
|
#ifdef CONFIG_BTRFS_ASSERT
|
|
|
|
|
2015-04-25 00:11:57 +07:00
|
|
|
__cold
|
2013-08-27 03:53:15 +07:00
|
|
|
static inline void assfail(char *expr, char *file, int line)
|
|
|
|
{
|
2013-12-20 23:37:06 +07:00
|
|
|
pr_err("BTRFS: assertion failed: %s, file: %s, line: %d",
|
2013-08-27 03:53:15 +07:00
|
|
|
expr, file, line);
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
|
|
|
|
#define ASSERT(expr) \
|
|
|
|
(likely(expr) ? (void)0 : assfail(#expr, __FILE__, __LINE__))
|
|
|
|
#else
|
|
|
|
#define ASSERT(expr) ((void)0)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#define btrfs_assert()
|
2012-07-31 04:40:13 +07:00
|
|
|
__printf(5, 6)
|
2015-04-25 00:11:57 +07:00
|
|
|
__cold
|
2011-01-06 18:30:25 +07:00
|
|
|
void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function,
|
2012-03-01 20:57:30 +07:00
|
|
|
unsigned int line, int errno, const char *fmt, ...);
|
2011-01-06 18:30:25 +07:00
|
|
|
|
2015-06-15 20:41:19 +07:00
|
|
|
const char *btrfs_decode_error(int errno);
|
2012-07-31 04:40:13 +07:00
|
|
|
|
2015-04-25 00:11:57 +07:00
|
|
|
__cold
|
2012-03-01 23:24:58 +07:00
|
|
|
void __btrfs_abort_transaction(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, const char *function,
|
|
|
|
unsigned int line, int errno);
|
|
|
|
|
2012-07-25 00:58:43 +07:00
|
|
|
#define btrfs_set_fs_incompat(__fs_info, opt) \
|
|
|
|
__btrfs_set_fs_incompat((__fs_info), BTRFS_FEATURE_INCOMPAT_##opt)
|
|
|
|
|
|
|
|
static inline void __btrfs_set_fs_incompat(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 flag)
|
|
|
|
{
|
|
|
|
struct btrfs_super_block *disk_super;
|
|
|
|
u64 features;
|
|
|
|
|
|
|
|
disk_super = fs_info->super_copy;
|
|
|
|
features = btrfs_super_incompat_flags(disk_super);
|
|
|
|
if (!(features & flag)) {
|
2013-04-11 17:30:16 +07:00
|
|
|
spin_lock(&fs_info->super_lock);
|
|
|
|
features = btrfs_super_incompat_flags(disk_super);
|
|
|
|
if (!(features & flag)) {
|
|
|
|
features |= flag;
|
|
|
|
btrfs_set_super_incompat_flags(disk_super, features);
|
2013-12-20 23:37:06 +07:00
|
|
|
btrfs_info(fs_info, "setting %llu feature flag",
|
2013-04-11 17:30:16 +07:00
|
|
|
flag);
|
|
|
|
}
|
|
|
|
spin_unlock(&fs_info->super_lock);
|
2012-07-25 00:58:43 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-03-08 02:22:04 +07:00
|
|
|
#define btrfs_fs_incompat(fs_info, opt) \
|
|
|
|
__btrfs_fs_incompat((fs_info), BTRFS_FEATURE_INCOMPAT_##opt)
|
|
|
|
|
|
|
|
static inline int __btrfs_fs_incompat(struct btrfs_fs_info *fs_info, u64 flag)
|
|
|
|
{
|
|
|
|
struct btrfs_super_block *disk_super;
|
|
|
|
disk_super = fs_info->super_copy;
|
|
|
|
return !!(btrfs_super_incompat_flags(disk_super) & flag);
|
|
|
|
}
|
|
|
|
|
2012-09-18 20:52:32 +07:00
|
|
|
/*
|
|
|
|
* Call btrfs_abort_transaction as early as possible when an error condition is
|
|
|
|
* detected, that way the exact line number is reported.
|
|
|
|
*/
|
2012-03-01 23:24:58 +07:00
|
|
|
#define btrfs_abort_transaction(trans, root, errno) \
|
|
|
|
do { \
|
2015-04-25 00:11:54 +07:00
|
|
|
/* Report first abort since mount */ \
|
|
|
|
if (!test_and_set_bit(BTRFS_FS_STATE_TRANS_ABORTED, \
|
|
|
|
&((root)->fs_info->fs_state))) { \
|
|
|
|
WARN(1, KERN_DEBUG \
|
|
|
|
"BTRFS: Transaction aborted (error %d)\n", \
|
|
|
|
(errno)); \
|
|
|
|
} \
|
|
|
|
__btrfs_abort_transaction((trans), (root), __func__, \
|
|
|
|
__LINE__, (errno)); \
|
2012-03-01 23:24:58 +07:00
|
|
|
} while (0)
|
2011-01-06 18:30:25 +07:00
|
|
|
|
2015-09-25 13:43:01 +07:00
|
|
|
#define btrfs_std_error(fs_info, errno, fmt, args...) \
|
2012-03-01 20:57:30 +07:00
|
|
|
do { \
|
|
|
|
__btrfs_std_error((fs_info), __func__, __LINE__, \
|
|
|
|
(errno), fmt, ##args); \
|
2011-01-06 18:30:25 +07:00
|
|
|
} while (0)
|
2008-07-24 23:16:36 +07:00
|
|
|
|
2012-07-31 04:40:13 +07:00
|
|
|
__printf(5, 6)
|
2015-04-25 00:11:57 +07:00
|
|
|
__cold
|
2011-10-04 10:22:31 +07:00
|
|
|
void __btrfs_panic(struct btrfs_fs_info *fs_info, const char *function,
|
|
|
|
unsigned int line, int errno, const char *fmt, ...);
|
|
|
|
|
2013-01-31 07:54:55 +07:00
|
|
|
/*
|
|
|
|
* If BTRFS_MOUNT_PANIC_ON_FATAL_ERROR is in mount_opt, __btrfs_panic
|
|
|
|
* will panic(). Otherwise we BUG() here.
|
|
|
|
*/
|
2011-10-04 10:22:31 +07:00
|
|
|
#define btrfs_panic(fs_info, errno, fmt, args...) \
|
|
|
|
do { \
|
2013-01-31 07:54:55 +07:00
|
|
|
__btrfs_panic(fs_info, __func__, __LINE__, errno, fmt, ##args); \
|
|
|
|
BUG(); \
|
2011-01-06 18:30:25 +07:00
|
|
|
} while (0)
|
2008-07-24 23:16:36 +07:00
|
|
|
|
|
|
|
/* acl.c */
|
2009-10-14 00:50:18 +07:00
|
|
|
#ifdef CONFIG_BTRFS_FS_POSIX_ACL
|
2011-07-23 22:37:31 +07:00
|
|
|
struct posix_acl *btrfs_get_acl(struct inode *inode, int type);
|
2013-12-20 20:16:43 +07:00
|
|
|
int btrfs_set_acl(struct inode *inode, struct posix_acl *acl, int type);
|
2009-11-12 16:35:27 +07:00
|
|
|
int btrfs_init_acl(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *inode, struct inode *dir);
|
2011-07-14 10:17:39 +07:00
|
|
|
#else
|
2011-08-03 14:14:05 +07:00
|
|
|
#define btrfs_get_acl NULL
|
2013-12-20 20:16:43 +07:00
|
|
|
#define btrfs_set_acl NULL
|
2011-07-14 10:17:39 +07:00
|
|
|
static inline int btrfs_init_acl(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *inode, struct inode *dir)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 00:14:11 +07:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 21:45:14 +07:00
|
|
|
/* relocation.c */
|
|
|
|
int btrfs_relocate_block_group(struct btrfs_root *root, u64 group_start);
|
|
|
|
int btrfs_init_reloc_root(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
|
|
|
int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
|
|
|
int btrfs_recover_relocation(struct btrfs_root *root);
|
|
|
|
int btrfs_reloc_clone_csums(struct inode *inode, u64 file_pos, u64 len);
|
2013-08-31 02:09:51 +07:00
|
|
|
int btrfs_reloc_cow_block(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, struct extent_buffer *buf,
|
|
|
|
struct extent_buffer *cow);
|
2015-08-06 19:58:11 +07:00
|
|
|
void btrfs_reloc_pre_snapshot(struct btrfs_pending_snapshot *pending,
|
2010-05-16 21:49:59 +07:00
|
|
|
u64 *bytes_to_reserve);
|
2012-03-01 23:24:58 +07:00
|
|
|
int btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
|
2010-05-16 21:49:59 +07:00
|
|
|
struct btrfs_pending_snapshot *pending);
|
2011-03-08 20:14:00 +07:00
|
|
|
|
|
|
|
/* scrub.c */
|
2012-11-05 23:03:39 +07:00
|
|
|
int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
|
|
|
|
u64 end, struct btrfs_scrub_progress *progress,
|
2012-11-06 00:29:28 +07:00
|
|
|
int readonly, int is_dev_replace);
|
2012-03-01 20:56:26 +07:00
|
|
|
void btrfs_scrub_pause(struct btrfs_root *root);
|
|
|
|
void btrfs_scrub_continue(struct btrfs_root *root);
|
2012-11-05 23:03:39 +07:00
|
|
|
int btrfs_scrub_cancel(struct btrfs_fs_info *info);
|
|
|
|
int btrfs_scrub_cancel_dev(struct btrfs_fs_info *info,
|
|
|
|
struct btrfs_device *dev);
|
2011-03-08 20:14:00 +07:00
|
|
|
int btrfs_scrub_progress(struct btrfs_root *root, u64 devid,
|
|
|
|
struct btrfs_scrub_progress *progress);
|
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 15:46:55 +07:00
|
|
|
|
|
|
|
/* dev-replace.c */
|
|
|
|
void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info);
|
|
|
|
void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info);
|
2014-11-25 15:39:28 +07:00
|
|
|
void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount);
|
|
|
|
|
|
|
|
static inline void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
btrfs_bio_counter_sub(fs_info, 1);
|
|
|
|
}
|
2011-03-08 20:14:00 +07:00
|
|
|
|
btrfs: initial readahead code and prototypes
This is the implementation for the generic read ahead framework.
To trigger a readahead, btrfs_reada_add must be called. It will start
a read ahead for the given range [start, end) on tree root. The returned
handle can either be used to wait on the readahead to finish
(btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).
The read ahead works as follows:
On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
reada_start_machine will then search for extents to prefetch and trigger
some reads. When a read finishes for a node, all contained node/leaf
pointers that lie in the given range will also be enqueued. The reads will
be triggered in sequential order, thus giving a big win over a naive
enumeration. It will also make use of multi-device layouts. Each disk
will have its on read pointer and all disks will by utilized in parallel.
Also will no two disks read both sides of a mirror simultaneously, as this
would waste seeking capacity. Instead both disks will read different parts
of the filesystem.
Any number of readaheads can be started in parallel. The read order will be
determined globally, i.e. 2 parallel readaheads will normally finish faster
than the 2 started one after another.
Changes v2:
- protect root->node by transaction instead of node_lock
- fix missed branches:
The readahead had a too simple check to determine if a branch from
a node should be checked or not. It now also records the upper bound
of each node to see if the requested RA range lies within.
- use KERN_CONT to debug output, to avoid line breaks
- defer reada_start_machine to worker to avoid deadlock
Changes v3:
- protect root->node by rcu
Changes v5:
- changed EIO-semantics of reada_tree_block_flagged
- remove spin_lock from reada_control and make elems an atomic_t
- remove unused read_total from reada_control
- kill reada_key_cmp, use btrfs_comp_cpu_keys instead
- use kref-style release functions where possible
- return struct reada_control * instead of void * from btrfs_reada_add
Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-23 19:33:49 +07:00
|
|
|
/* reada.c */
|
|
|
|
struct reada_control {
|
|
|
|
struct btrfs_root *root; /* tree to prefetch */
|
|
|
|
struct btrfs_key key_start;
|
|
|
|
struct btrfs_key key_end; /* exclusive */
|
|
|
|
atomic_t elems;
|
|
|
|
struct kref refcnt;
|
|
|
|
wait_queue_head_t wait;
|
|
|
|
};
|
|
|
|
struct reada_control *btrfs_reada_add(struct btrfs_root *root,
|
|
|
|
struct btrfs_key *start, struct btrfs_key *end);
|
|
|
|
int btrfs_reada_wait(void *handle);
|
|
|
|
void btrfs_reada_detach(void *handle);
|
|
|
|
int btree_readahead_hook(struct btrfs_root *root, struct extent_buffer *eb,
|
|
|
|
u64 start, int err);
|
|
|
|
|
2012-05-29 22:06:54 +07:00
|
|
|
static inline int is_fstree(u64 rootid)
|
|
|
|
{
|
|
|
|
if (rootid == BTRFS_FS_TREE_OBJECTID ||
|
2015-02-27 15:24:23 +07:00
|
|
|
((s64)rootid >= (s64)BTRFS_FIRST_FREE_OBJECTID &&
|
|
|
|
!btrfs_qgroup_level(rootid)))
|
2012-05-29 22:06:54 +07:00
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
2013-02-10 06:38:06 +07:00
|
|
|
|
|
|
|
static inline int btrfs_defrag_cancelled(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
return signal_pending(current);
|
|
|
|
}
|
|
|
|
|
2013-10-12 01:44:09 +07:00
|
|
|
/* Sanity test specific functions */
|
|
|
|
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
|
|
|
|
void btrfs_test_destroy_inode(struct inode *inode);
|
|
|
|
#endif
|
2013-02-10 06:38:06 +07:00
|
|
|
|
2014-09-30 04:53:21 +07:00
|
|
|
static inline int btrfs_test_is_dummy_root(struct btrfs_root *root)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
|
|
|
|
if (unlikely(test_bit(BTRFS_ROOT_DUMMY_ROOT, &root->state)))
|
|
|
|
return 1;
|
|
|
|
#endif
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2007-02-02 21:18:22 +07:00
|
|
|
#endif
|