Commit Graph

401 Commits

Author SHA1 Message Date
Yan Zheng
c146afad2c Btrfs: mount ro and remount support
This patch adds mount ro and remount support. The main
changes in patch are: adding btrfs_remount and related
helper function; splitting the transaction related code
out of close_ctree into btrfs_commit_super; updating
allocator to properly handle read only block group.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-11-12 14:34:12 -05:00
Chris Mason
6f3577bdc7 Btrfs: Improve metadata read latencies
This fixes latency problems on metadata reads by making sure they
don't go through the async submit queue, and by tuning down the amount
of readahead done during btree searches.

Also, the btrfs bdi congestion function is tuned to ignore the
number of pending async bios and checksums pending.  There is additional
code that throttles new async bios now and the congestion function
doesn't need to worry about it anymore.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-13 09:59:36 -05:00
Chris Mason
240d5d482b Btrfs: tune btrfs unplug functions for a small number of devices
When btrfs unplugs, it tries to find the correct device to unplug
via search through the extent_map tree.  This avoids unplugging
a device that doesn't need it, but is a waste of time for filesystems
with a small number of devices.

This patch checks the total number of devices before doing the
search.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 13:08:31 -05:00
Chris Mason
771ed689d2 Btrfs: Optimize compressed writeback and reads
When reading compressed extents, try to put pages into the page cache
for any pages covered by the compressed extent that readpages didn't already
preload.

Add an async work queue to handle transformations at delayed allocation processing
time.  Right now this is just compression.  The workflow is:

1) Find offsets in the file marked for delayed allocation
2) Lock the pages
3) Lock the state bits
4) Call the async delalloc code

The async delalloc code clears the state lock bits and delalloc bits.  It is
important this happens before the range goes into the work queue because
otherwise it might deadlock with other work queue items that try to lock
those extent bits.

The file pages are compressed, and if the compression doesn't work the
pages are written back directly.

An ordered work queue is used to make sure the inodes are written in the same
order that pdflush or writepages sent them down.

This changes extent_write_cache_pages to let the writepage function
update the wbc nr_written count.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-06 22:02:51 -05:00
Chris Mason
4a69a41009 Btrfs: Add ordered async work queues
Btrfs uses kernel threads to create async work queues for cpu intensive
operations such as checksumming and decompression.  These work well,
but they make it difficult to keep IO order intact.

A single writepages call from pdflush or fsync will turn into a number
of bios, and each bio is checksummed in parallel.  Once the checksum is
computed, the bio is sent down to the disk, and since we don't control
the order in which the parallel operations happen, they might go down to
the disk in almost any order.

The code deals with this somewhat by having deep work queues for a single
kernel thread, making it very likely that a single thread will process all
the bios for a single inode.

This patch introduces an explicitly ordered work queue.  As work structs
are placed into the queue they are put onto the tail of a list.  They have
three callbacks:

->func (cpu intensive processing here)
->ordered_func (order sensitive processing here)
->ordered_free (free the work struct, all processing is done)

The work struct has three callbacks.  The func callback does the cpu intensive
work, and when it completes the work struct is marked as done.

Every time a work struct completes, the list is checked to see if the head
is marked as done.  If so the ordered_func callback is used to do the
order sensitive processing and the ordered_free callback is used to do
any cleanup.  Then we loop back and check the head of the list again.

This patch also changes the checksumming code to use the ordered workqueues.
One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-06 22:03:00 -05:00
Yan Zheng
84234f3a1f Btrfs: Add root tree pointer transaction ids
This patch adds transaction IDs to root tree pointers.
Transaction IDs in tree pointers are compared with the
generation numbers in block headers when reading root
blocks of trees. This can detect some types of IO errors.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-29 14:49:05 -04:00
Josef Bacik
2517920135 Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.

There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.

The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same.  Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.

Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one.  I have tested this heavily and it does
not appear to break anything.  This has to be applied on top of my
find_free_extent redo patch.

I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-29 14:49:05 -04:00
Yan Zheng
f82d02d9d8 Btrfs: Improve space balancing code
This patch improves the space balancing code to keep more sharing
of tree blocks. The only case that breaks sharing of tree blocks is
data extents get fragmented during balancing. The main changes in
this patch are:

Add a 'drop sub-tree' function. This solves the problem in old code
that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.

Remove relocation mapping tree. Relocation mappings are stored in
struct btrfs_ref_path and updated dynamically during walking up/down
the reference path. This reduces CPU usage and simplifies code.

This patch also fixes a bug. Root items for reloc trees should be
updated in btrfs_free_reloc_root.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-29 14:49:05 -04:00
Chris Mason
c8b978188c Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents.  It does some fairly large
surgery to the writeback paths.

Compression is off by default and enabled by mount -o compress.  Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.

If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.

* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler.  This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.

* Inline extents are inserted at delalloc time now.  This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.

* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.

From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field.  Neither the encryption or the
'other' field are currently used.

In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k.  This is a
software only limit, the disk format supports u64 sized compressed extents.

In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k.  This is a software only limit
and will be subject to tuning later.

Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data.  This way additional encodings can be
layered on without having to figure out which encoding to checksum.

Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread.  This makes it tricky to
spread the compression load across all the cpus on the box.  We'll have to
look at parallel pdflush walks of dirty inodes at a later time.

Decompression is hooked into readpages and it does spread across CPUs nicely.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
Jim Meyering
83afeac42c Btrfs: disk-io.c (open_ctree): avoid leaks upon allocation failure
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-01 19:09:51 -04:00
Jim Meyering
0463bb4e8d Btrfs: disk-io.c (open_ctree): Don't deref. NULL upon failed kzalloc
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-01 19:09:04 -04:00
Chris Mason
d352ac6814 Btrfs: add and improve comments
This improves the comments at the top of many functions.  It didn't
dive into the guts of functions because I was trying to
avoid merging problems with the new allocator and back reference work.

extent-tree.c and volumes.c were both skipped, and there is definitely
more work todo in cleaning and commenting the code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-29 15:18:18 -04:00
Chris Mason
8c8bee1d7c Btrfs: Wait for IO on the block device inodes of newly added devices
btrfs-vol -a /dev/xxx will zero the first and last two MB of the device.
The kernel code needs to wait for this IO to finish before it adds
the device.

btrfs metadata IO does not happen through the block device inode.  A
separate address space is used, allowing the zero filled buffer heads in
the block device inode to be written to disk after FS metadata starts
going down to the disk via the btrfs metadata inode.

The end result is zero filled metadata blocks after adding new devices
into the filesystem.

The fix is a simple filemap_write_and_wait on the block device inode
before actually inserting it into the pool of available devices.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-29 11:19:10 -04:00
Zheng Yan
1a40e23b95 Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format.  Before, btrfs-vol -b would break any COW links
on data blocks or metadata.  This was slow and caused the amount
of space used to explode if a large number of snapshots were present.

The new code can keeps the sharing of all data extents and
most of the tree blocks.

To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.

To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).

To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:09:34 -04:00
Zheng Yan
e465768938 Btrfs: Add shared reference cache
Btrfs has a cache of reference counts in leaves, allowing it to
avoid reading tree leaves while deleting snapshots.  To reduce
contention with multiple subvolumes, this cache is private to each
subvolume.

This patch adds shared reference cache support. The new space
balancing code plays with multiple subvols at the same time, So
the old per-subvol reference cache is not well suited.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:04:53 -04:00
Chris Mason
24ab9cd85c Btrfs: Raise thresholds for metadata writeback
Btrfs metadata writeback is fairly expensive.  Once a tree block is written
it must be cowed before it can be changed again.  The btree writepages
code has a threshold based on a count of dirty btree bytes which is
updated as IO is sent out.

This changes btree_writepages to skip the writeout if there are less
than 32MB of dirty bytes from the btrees, improving performance
across many workloads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 15:41:59 -04:00
Chris Mason
2b1f55b0f0 Remove Btrfs compat code for older kernels
Btrfs had compatibility code for kernels back to 2.6.18.  These have
been removed, and will be maintained in a separate backport
git tree from now on.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 15:41:59 -04:00
Zheng Yan
31840ae1a6 Btrfs: Full back reference support
This patch makes the back reference system to explicit record the
location of parent node for all types of extents. The location of
parent node is placed into the offset field of backref key. Every
time a tree block is balanced, the back references for the affected
lower level extents are updated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
ce3ed71a58 Btrfs: Checksum tree blocks in the background
Tree blocks were using async bio submission, but the sum was still
being done directly during writepage.  This moves the checksumming
into the worker thread.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Josef Bacik
0f9dd46cda Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size.  The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing.  If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible.  When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.

2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset.  also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.

3) cleaned up the allocation code to make it a little easier to read and a
little less complicated.  Basically there are 3 steps, first look from our
provided hint.  If we couldn't find from that given hint, start back at our
original search start and look for space from there.  If that fails try to
allocate space if we can and start looking again.  If not we're screwed and need
to start over again.

4) small fixes.  there were some issues in volumes.c where we wouldn't allocate
the rest of the disk.  fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space.  Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space.  Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups.  The alloc_hint has fixed
this slight degredation and made things semi-normal.

There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space.  This only happens with metadata
allocations, and only when we are almost full.  So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall.  I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
23a07867b7 Btrfs: Fix mismerge in block header checks
I had incorrectly disabled the check for the block number being correct
in the header block.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
d0c803c404 Btrfs: Record dirty pages tree-log pages in an extent_io tree
This is the same way the transaction code makes sure that all the
other tree blocks are safely on disk.  There's an extent_io tree
for each root, and any blocks allocated to the tree logs are
recorded in that tree.

At tree-log sync, the extent_io tree is walked to flush down the
dirty pages and wait for them.

The main benefit is less time spent walking the tree log and skipping
clean pages, and getting sequential IO down to the drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
d00aff0013 Btrfs: Optimize tree log block allocations
Since tree log blocks get freed every transaction, they never really
need to be written to disk.  This skips the step where we update
metadata to record they were allocated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
3a5f1d458a Btrfs: Optimize btree walking while logging inodes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
98509cfc5a Btrfs: Fix releasepage to properly keep dirty and writeback pages
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
4bef084857 Btrfs: Tree logging fixes
* Pin down data blocks to prevent them from being reallocated like so:

trans 1: allocate file extent
trans 2: free file extent
trans 3: free file extent during old snapshot deletion
trans 3: allocate file extent to new file
trans 3: fsync new file

Before the tree logging code, this was legal because the fsync
would commit the transation that did the final data extent free
and the transaction that allocated the extent to the new file
at the same time.

With the tree logging code, the tree log subtransaction can commit
before the transaction that freed the extent.  If we crash,
we're left with two different files using the extent.

* Don't wait in start_transaction if log replay is going on.  This
avoids deadlocks from iput while we're cleaning up link counts in the
replay code.

* Don't deadlock in replay_one_name by trying to read an inode off
the disk while holding paths for the directory

* Hold the buffer lock while we mark a buffer as written.  This
closes a race where someone is changing a buffer while we write it.
They are supposed to mark it dirty again after they change it, but
this violates the cow rules.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
e02119d5a7 Btrfs: Add a write ahead tree log to optimize synchronous operations
File syncs and directory syncs are optimized by copying their
items into a special (copy-on-write) log tree.  There is one log tree per
subvolume and the btrfs super block points to a tree of log tree roots.

After a crash, items are copied out of the log tree and back into the
subvolume.  See tree-log.c for all the details.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
a1b32a5932 Btrfs: Add debugging checks to track down corrupted metadata
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
9473f16c75 Btrfs: Throttle for async bio submits higher up the chain
The current code waits for the count of async bio submits to get below
a given threshold if it is too high right after adding the latest bio
to the work queue.  This isn't optimal because the caller may have
sequential adjacent bios pending they are waiting to send down the pipe.

This changeset requires the caller to wait on the async bio count,
and changes the async checksumming submits to wait for async bios any
time they self throttle.

The end result is much higher sequential throughput.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
b64a2851ba Btrfs: Wait for async bio submissions to make some progress at queue time
Before, the btrfs bdi congestion function was used to test for too many
async bios.  This keeps that check to throttle pdflush, but also
adds a check while queuing bios.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
53863232ef Btrfs: Lower contention on the csum mutex
This takes the csum mutex deeper in the call chain and releases it
more often.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
4854ddd0ed Btrfs: Wait for kernel threads to make progress during async submission
Before this change, btrfs would use a bdi congestion function to make
sure there weren't too many pending async checksum work items.

This change makes the process creating async work items wait instead,
leading to fewer congestion returns from the bdi.  This improves
pdflush background_writeout scanning.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
5443be45f5 Btrfs: Give all the worker threads descriptive names
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
777e6bd706 Btrfs: Transaction commit: don't use filemap_fdatawait
After writing out all the remaining btree blocks in the transaction,
the commit code would use filemap_fdatawait to make sure it was all
on disk.  This means it would wait for blocks written by other procs
as well.

The new code walks the list of blocks for this transaction again
and waits only for those required by this transaction.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
0986fe9eac Btrfs: Count async bios separately from async checksum work items
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
b720d20952 Btrfs: Limit the number of async bio submission kthreads to the number of devices
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
4ca8b41e3f Btrfs: Avoid calling into the FS for the final iput on fake root inodes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
ea8c281947 Btrfs: Maintain a list of inodes that are delalloc and a way to wait on them
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
2dd3e67b1e Btrfs: More throttle tuning
* Make walk_down_tree wake up throttled tasks more often
* Make walk_down_tree call cond_resched during long loops
* As the size of the ref cache grows, wait longer in throttle
* Get rid of the reada code in walk_down_tree, the leaves don't get
  read anymore, thanks to the ref cache.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
61b4944018 Btrfs: Fix streaming read performance with checksumming on
Large streaming reads make for large bios, which means each entry on the
list async work queues represents a large amount of data.  IO
congestion throttling on the device was kicking in before the async
worker threads decided a single thread was busy and needed some help.

The end result was that a streaming read would result in a single CPU
running at 100% instead of balancing the work off to other CPUs.

This patch also changes the pre-IO checksum lookup done by reads to
work on a per-bio basis instead of a per-page.  This results in many
extra btree lookups on large streaming reads.  Doing the checksum lookup
right before bio submit allows us to reuse searches while processing
adjacent offsets.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Yan
bcc63abbf3 Btrfs: implement memory reclaim for leaf reference cache
The memory reclaiming issue happens when snapshot exists. In that
case, some cache entries may not be used during old snapshot dropping,
so they will remain in the cache until umount.

The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
the patch makes all dead roots of a given snapshot linked together in order of
create time. After a old snapshot was completely dropped, we check the dead
root list and remove all cache entries created before the oldest dead root in
the list.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
33958dc6d3 Btrfs: Fix verify_parent_transid
It was incorrectly clearing the up to date flag on the buffer even
when the buffer properly verified.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
ab78c84de1 Btrfs: Throttle operations if the reference cache gets too large
A large reference cache is directly related to a lot of work pending
for the cleaner thread.  This throttles back new operations based on
the size of the reference cache so the cleaner thread will be able to keep
up.

Overall, this actually makes the FS faster because the cleaner thread will
be more likely to find things in cache.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
017e5369eb Btrfs: Leaf reference cache update
This changes the reference cache to make a single cache per root
instead of one cache per transaction, and to key by the byte number
of the disk block instead of the keys inside.

This makes it much less likely to have cache misses if a snapshot
or something has an extra reference on a higher node or a leaf while
the first transaction that added the leaf into the cache is dropping.

Some throttling is added to functions that free blocks heavily so they
wait for old transactions to drop.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Yan Zheng
31153d8128 Btrfs: Add a leaf reference cache
Much of the IO done while dropping snapshots is done looking up
leaves in the filesystem trees to see if they point to any extents and
to drop the references on any extents found.

This creates a cache so that IO isn't required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Josef Bacik
7b12876623 Btrfs: Create orphan inode records to prevent lost files after a crash
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
3eaa288527 Btrfs: Fix the defragmention code and the block relocation code for data=ordered
Before setting an extent to delalloc, the code needs to wait for
pending ordered extents.

Also, the relocation code needs to wait for ordered IO before scanning
the block group again.  This is because the extents are not removed
until the IO for the new extents is finished

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
89642229a5 Btrfs: Search data ordered extents first for checksums on read
Checksum items are not inserted into the tree until all of the io from a
given extent is complete.  This means one dirty page from an extent may
be written, freed, and then read again before the entire extent is on disk
and the checksum item is inserted.

The checksums themselves are stored in the ordered extent so they can
be inserted in bulk when IO is complete.  On read, if a checksum item isn't
found, the ordered extents were being searched for a checksum record.

This all worked most of the time, but the checksum insertion code tries
to reduce the number of tree operations by pre-inserting checksum items
based on i_size and a few other factors.  This means the read code might
find a checksum item that hasn't yet really been filled in.

This commit changes things to check the ordered extents first and only
dive into the btree if nothing was found.  This removes the need for
extra locking and is more reliable.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
6af118ce51 Btrfs: Index extent buffers in an rbtree
Before, extent buffers were a temporary object, meant to map a number of pages
at once and collect operations on them.

But, a few extra fields have crept in, and they are also the best place to
store a per-tree block lock field as well.  This commit puts the extent
buffers into an rbtree, and ensures a single extent buffer for each
tree block.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
f929574938 btrfs_start_transaction: wait for commits in progress to finish
btrfs_commit_transaction has to loop waiting for any writers in the
transaction to finish before it can proceed.  btrfs_start_transaction
should be polite and not join a transaction that is in the process
of being finished off.

There are a few places that can't wait, basically the ones doing IO that
might be needed to finish the transaction.  For them, btrfs_join_transaction
is added.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason
247e743cbe Btrfs: Use async helpers to deal with pages that have been improperly dirtied
Higher layers sometimes call set_page_dirty without asking the filesystem
to help.  This causes many problems for the data=ordered and cow code.
This commit detects pages that haven't been properly setup for IO and
kicks off an async helper to deal with them.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason
e6dcd2dc9c Btrfs: New data=ordered implementation
The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk.  This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.

The new code changes the way data allocations and extents work:

* When delayed allocation is filled, data extents are reserved, and
  the extent bit EXTENT_ORDERED is set on the entire range of the extent.
  A struct btrfs_ordered_extent is allocated an inserted into a per-inode
  rbtree to track the pending extents.

* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
  to that page.

* When all of the bytes corresponding to a single struct btrfs_ordered_extent
  are written, The previously reserved extent is inserted into the FS
  btree and into the extent allocation trees.  The checksums for the file
  data are also updated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason
77a41afb7d Btrfs: Drop some verbose printks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason
7d9eb12c87 Btrfs: Add locking around volume management (device add/remove/balance)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason
3f157a2fd2 Btrfs: Online btree defragmentation fixes
The btree defragger wasn't making forward progress because the new key wasn't
being saved by the btrfs_search_forward function.

This also disables the automatic btree defrag, it wasn't scaling well to
huge filesystems.  The auto-defrag needs to be done differently.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason
a74a4b97b6 Btrfs: Replace the transaction work queue with kthreads
This creates one kthread for commits and one kthread for
deleting old snapshots.  All the work queues are removed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
89ce8a63d0 Add btrfs_end_transaction_throttle to force writers to wait for pending commits
The existing throttle mechanism was often not sufficient to prevent
new writers from coming in and making a given transaction run forever.
This adds an explicit wait at the end of most operations so they will
allow the current transaction to close.

There is no wait inside file_write, inode updates, or cow filling, all which
have different deadlock possibilities.

This is a temporary measure until better asynchronous commit support is
added.  This code leads to stalls as it waits for data=ordered
writeback, and it really needs to be fixed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
333db94cdd Btrfs: Fix snapshot deletion to release the alloc_mutex much more often.
This lowers the impact of snapshot deletion on the rest of the FS.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
051e1b9f74 Drop locks in btrfs_search_slot when reading a tree block.
One lock per btree block can make for significant congestion if everyone
has to wait for IO at the high levels of the btree.  This drops
locks held by a path when doing reads during a tree search.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
a213501153 Btrfs: Replace the big fs_mutex with a collection of other locks
Extent alloctions are still protected by a large alloc_mutex.
Objectid allocations are covered by a objectid mutex
Other btree operations are protected by a lock on individual btree nodes

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
925baeddc5 Btrfs: Start btree concurrency work.
The allocation trees and the chunk trees are serialized via their own
dedicated mutexes.  This means allocation location is still not very
fine grained.

The main FS btree is protected by locks on each block in the btree.  Locks
are taken top / down, and as processing finishes on a given level of the
tree, the lock is released after locking the lower level.

The end result of a search is now a path where only the lowest level
is locked.  Releasing or freeing the path drops any locks held.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
1cc127b5d1 Btrfs: Add a thread pool just for submit_bio
If a bio submission is after a lock holder waiting for the bio
on the work queue, it is possible to deadlock.  Move the bios
into their own pool.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
4543df7ecc Btrfs: Add a mount option to control worker thread pool size
mount -o thread_pool_size changes the default, which is
min(num_cpus + 2, 8).  Larger thread pools would make more sense on
very large disk arrays.

This mount option controls the max size of each thread pool.  There
are multiple thread pools, so the total worker count will be larger
than the mount option.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
8b71284292 Btrfs: Add async worker threads for pre and post IO checksumming
Btrfs has been using workqueues to spread the checksumming load across
other CPUs in the system.  But, workqueues only schedule work on the
same CPU that queued the work, giving them a limited benefit for systems with
higher CPU counts.

This code adds a generic facility to schedule work with pools of kthreads,
and changes the bio submission code to queue bios up.  The queueing is
important to make sure large numbers of procs on the system don't
turn streaming workloads into random workloads by sending IO down
concurrently.

The end result of all of this is much higher performance (and CPU usage) when
doing checksumming on large machines.  Two worker pools are created,
one for writes and one for endio processing.  The two could deadlock if
we tried to service both from a single pool.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Christoph Hellwig
edf24abe51 btrfs: sanity mount option parsing and early mount code
Also adds lots of comments to describe what's going on here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Jan Engelhardt
51ebc0d3d5 Btrfs: bdi_init and bdi_destroy come with 2.6.23
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
da496f2acf Btrfs: Always use the async submission queue for checksummed writes
This avoids IO stalls and poorly ordered IO from inline writers mixing in
with the async submission queue

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
1c8cfcc159 Btrfs: Enable btree balancing on old kernels again
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
cb03c743c6 Btrfs: Change the congestion functions to meter the number of async submits as well
The async submit workqueue was absorbing too many requests, leading to long
stalls where the async submitters were stalling.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
a0af469b58 Fix btrfs_open_devices to deal with changes since the scan ioctls
Devices can change after the scan ioctls are done, and btrfs_open_devices
needs to be able to verify them as they are opened and used by the FS.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
dfe2502068 Btrfs: Add mount -o degraded to allow mounts to continue with missing devices
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
1259ab75c6 Btrfs: Handle write errors on raid1 and raid10
When duplicate copies exist, writes are allowed to fail to one of those
copies.  This changeset includes a few changes that allow the FS to
continue even when some IOs fail.

It also adds verification of the parent generation number for btree blocks.
This generation is stored in the pointer to a block, and it ensures
that missed writes to are detected.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
ca7a79ad8d Btrfs: Pass down the expected generation number when reading tree blocks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
188de649c5 Btrfs: Don't do btree balance_dirty_pages on old kernels, it stalls forever
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
a061fc8da7 Btrfs: Add support for online device removal
This required a few structural changes to the code that manages bdev pointers:

The VFS super block now gets an anon-bdev instead of a pointer to the
lowest bdev.  This allows us to avoid swapping the super block bdev pointer
around at run time.

The code to read in the super block no longer goes through the extent
buffer interface.  Things got ugly keeping the mapping constant.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
d6bfde8765 Btrfs: Fixes for 2.6.18 enterprise kernels
2.6.18 seems to get caught in an infinite loop when
cancel_rearming_delayed_workqueue is called more than once, so this switches
to cancel_delayed_work, which is arguably more correct.

Also, balance_dirty_pages can run into problems with 2.6.18 based kernels
because it doesn't have the per-bdi dirty limits.  This avoids calling
balance_dirty_pages on the btree inode unless there is actually something
to balance, which is a good optimization in general.

Finally there's a compile fix for ordered-data.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
a236aed14c Btrfs: Deal with failed writes in mirrored configurations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
4235298e4f Btrfs: Drop some verbose printks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
8f18cf1339 Btrfs: Make the resizer work based on shrinking and growing devices
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
84eed90fac Btrfs: Add failure handling for read_sys_array
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
bcbfce8abd Btrfs: Fix the unplug_io_fn to grab a consistent copy of page->mapping
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
38b669880d Deal with page == NULL in the btrfs_unplug_io_fn
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
f2d8d74d78 Btrfs: Make an unplug function that doesn't unplug every spindle
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
4ef64eae28 Btrfs: Remove debugging statements from the invalidatepage calls
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
4575c9ccee Btrfs: Scale the bdi ra_pages by the number of devices in the FS
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
9ad6b7bc2e Force page->private removal in btrfs_invalidatepage
btrfs_invalidatepage is not allowed to leave pages around on the lru.
Any such pages will trigger an oops later on because the VM will see
page->private and assume it is a buffer head.

This also forces extra flushes of the async work queues before
dropping all the pages on the btree inode during unmount.  Left over
items on the work queues are one possible cause of busy state ranges
during truncate_inode_pages.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
0afbaf8c82 Btrfs: Set the btree inode i_size to OFFSET_MAX
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
7b13b7b119 Btrfs: Don't drop extent_map cache during releasepage on the btree inode
The btree inode should only have a single extent_map in the cache,
it doesn't make sense to ever drop it.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
7b859fe7cd Btrfs: Only do async bio submission for pdflush
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
44b8bd7edd Btrfs: Create a work queue for bio writes
This allows checksumming to happen in parallel among many cpus, and
keeps us from bogging down pdflush with the checksumming code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
e17cade25f Btrfs: Add chunk uuids and update multi-device back references
Block headers now store the chunk tree uuid

Chunk items records the device uuid for each stripes

Device extent items record better back refs to the chunk tree

Block groups record better back refs to the chunk tree

The chunk tree format has also changed.  The objectid of BTRFS_CHUNK_ITEM_KEY
used to be the logical offset of the chunk.  Now it is a chunk tree id,
with the logical offset being stored in the offset field of the key.

This allows a single chunk tree to record multiple logical address spaces,
upping the number of bytes indexed by a chunk tree from 2^64 to
2^128.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
b248a41529 Btrfs: A few updates for 2.6.18 and versions older than 2.6.25
This includes fixing a missing spinlock init call that caused oops on mount
for most kernels other than 2.6.25.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Miguel
73f61b2a64 Btrfs: bio_endio support for linux 2.6.23 and older.
bio_endio() changed prototype on linux 2.6.24, support older kernels
using the older prototype.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Miguel
a5eb62e345 Btrfs: Endianess bug fix for v0.13 with kernels
Fix for a endianess BUG when using btrfs v0.13 with kernels older than 2.6.23

Problem:

Has of v0.13, btrfs-progs is using crc32c.c equivalent to the one found on
linux-2.6.23/lib/libcrc32c.c Since crc32c_le() changed in linux-2.6.23, when
running btrfs v0.13 with older kernels we have a missmatch between the versions
of crc32c_le() from btrfs-progs and libcrc32c in the kernel.  This missmatch
causes a bug when using btrfs on big endian machines.

Solution:
btrfs_crc32c() macro that when compiling for kernels older than 2.6.23, does
endianess conversion to parameters and return value of crc32c().
This endianess conversion nullifies the differences in implementation
of crc32c_le().
If kernel 2.6.23 or better, it calls crc32c().

Signed-off-by: Miguel Sousa Filipe <miguel.filipe@gmail.com>
---

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
3dd39914bc Btrfs: Add extra checks to avoid removing extent_state from pages we can't free
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
f29844623d Btrfs: Write out all super blocks on commit, and bring back proper barrier support
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
f188591e98 Btrfs: Retry metadata reads in the face of checksum failures
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
22c599485b Btrfs: Handle data block end_io through the async work queue
Before it was done by the bio end_io routine, the work queue code is able
to scale much better with faster IO subsystems.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
ce9adaa5a7 Btrfs: Do metadata checksums for reads via a workqueue
Before, metadata checksumming was done by the callers of read_tree_block,
which would set EXTENT_CSUM bits in the extent tree to show that a given
range of pages was already checksummed and didn't need to be verified
again.

But, those bits could go away via try_to_releasepage, and the end
result was bogus checksum failures on pages that never left the cache.

The new code validates checksums when the page is read.  It is a little
tricky because metadata blocks can span pages and a single read may
end up going via multiple bios.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
728131d8e4 Btrfs: Add additional debugging for metadata checksum failures
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
d18a2c4475 Btrfs: Fix allocation profile init
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
611f0e00a2 Btrfs: Add support for duplicate blocks on a single spindle
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
8790d502e4 Btrfs: Add support for mirroring across drives
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
0999df54f8 Btrfs: Verify checksums on tree blocks found without read_tree_block
Checksums were only verified by btrfs_read_tree_block, which meant the
functions to probe the page cache for blocks were not validating checksums.
Normally this is fine because the buffers will only be in cache if they
have already been validated.

But, there is a window while the buffer is being read from disk where
it could be up to date in the cache but not yet verified.  This patch
makes sure all buffers go through checksum verification before they
are used.

This is safer, and it prevents modification of buffers before they go
through the csum code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Yan
e58ca0203d Fix btrfs_fill_super to return -EINVAL when no FS found
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
63b10fc487 Reorder the flags field in struct btrfs_header and record a flag on writeout
This allows detection of blocks that have already been written in the
running transaction so they can be recowed instead of modified again.
It is step one in trusting the transid field of the block pointers.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
2d2ae54797 Btrfs: Add leak debugging for extent_buffer and extent_state
This also fixes one leak around the super block when failing to mount the
FS.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
83041add61 Btrfs: Use a higher default ra pages
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
0416008814 Create a btrfs backing dev info
This allows intelligent versions of unplug and congestion functions

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
593060d756 Btrfs: Implement raid0 when multiple devices are present
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
8a4b83cc8b Btrfs: Add support for device scanning and detection ioctls
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
239b14b32d Btrfs: Bring back mount -o ssd optimizations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
0d81ba5dbe Btrfs: Move device information into the super block so it can be scanned
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
6324fbf334 Btrfs: Dynamic chunk and block group allocation
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
0b86a832a1 Btrfs: Add support for multiple devices per filesystem
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
d7fc640e6f Btrfs: Allocator improvements
Reduce CPU time searching for free blocks by optimizing find_first_extent_bit

Fix find_free_extent to make better use of the last_alloc hint.  Before it
was often finding blocks just before the hint.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
a86c12c73d Btrfs: Create larger bios for btree blocks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
4529ba495c Btrfs: Add data block hints to SSD mode too
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
b0c68f8bed Btrfs: Enable delalloc accounting
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
6f568d35a0 Btrfs: mount -o max_inline=size to control the maximum inline extent size
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
70dec8079d Btrfs: extent_io and extent_state optimizations
The end_bio routines are changed to take a pointer to the extent state
struct, and the state tree is walked in order to set/clear appropriate
bits as IO completes.  This greatly reduces the number of rbtree searches
done by the end_bio handlers, and reduces lock contention.

The extent_io releasepage function is changed to avoid expensive searches
for locked state.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
d1310b2e0c Btrfs: Split the extent_map code into two parts
There is now extent_map for mapping offsets in the file to disk and
extent_io for state tracking, IO submission and extent_bufers.

The new extent_map code shifts from [start,end] pairs to [start,len], and
pushes the locking out into the caller.  This allows a few performance
optimizations and is easier to use.

A number of extent_map usage bugs were fixed, mostly with failing
to remove extent_map entries when changing the file.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
e18e4809b1 Btrfs: Add mount -o ssd, which includes optimizations for seek free storage
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
04005cc7a1 Btrfs: Use blk_congestion_wait on older kernels
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
cee36a03e8 Rework btrfs_drop_inode to avoid scheduling
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
b99aa6cb54 Btrfs: Only delete roots from sysfs when they were added to sysfs
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
55c69072d6 Btrfs: Fix extent_buffer usage when nodesize != leafsize
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
21ad10cf3e Btrfs: Add flush barriers on commit
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
e2008b6140 Btrfs: Add some simple throttling to wait for data=ordered and snapshot deletion
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
dc17ff8f11 Btrfs: Add data=ordered support
This forces file data extents down the disk along with the metadata that
references them.  The current implementation is fairly simple, and just
writes out all of the dirty pages in an inode before the commit.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
e4204dedbb Btrfs: Change tree block csum tagging to avoid false error messages
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
4313b3994d Btrfs: Reduce stack usage in the resizer, fix 32 bit compiles
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
edbd8d4efe Btrfs: Support for online FS resize (grow and shrink)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
1832a6d5ee Btrfs: Implement basic support for -ENOSPC
This is intended to prevent accidentally filling the drive.  A determined
user can still make things oops.

It includes some accounting of the current bytes under delayed allocation,
but this will change as things get optimized

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
6da6abae02 Btrfs: Back port to 2.6.18-el kernels
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
c59f8951d4 Btrfs: Add mount option to enforce a max extent size
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
d10c5f31c7 Btrfs: Fix extent_map and extent_state leaks by flushing lrus on FS unmount
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
b6cda9bcb4 Btrfs: Add mount -o nodatasum to turn of file data checksumming
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
d8d5f3e16d Btrfs: Add lowest key information to back refs for extent tree blocks as well.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
7bb86316c3 Btrfs: Add back pointers from extents to the btree or file referencing them
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
87ee04eb0f Btrfs: Add simple stripe size parameter
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
ca6646264b Btrfs: Add efficient dirty accounting to the extent_map tree
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
448d640b66 Btrfs: Fine tune the btree writeback exclusion some more
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
8790bad7fb Btrfs: Only limit btree writeback for pdflush
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
793955bca6 Btrfs: Limit btree writeback to prevent seeks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Yan
324ae4df00 Btrfs: Add block group pinned accounting back
This patch adds a helper function 'update_pinned_extents' to
extent-tree.c. The usage of the helper function is similar to
'update_block_group',  the last parameter of the function indicates
pin vs unpin.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason
0da5468f47 Btrfs: Use writepages for the metadata too
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason
35ebb934bd Btrfs: Fix PAGE_CACHE_SHIFT shifts on 32 bit machines
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Yan Zheng
44b36eb20b Btrfs: Properly setup root key while reading the root
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason
ff79f8190b Btrfs: Add back file data checksumming
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason
304fced668 Btrfs: balance_dirty_pages_ratelimited is causing problems, use nr == 1 always
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
19c00ddcc3 Btrfs: Add back metadata checksumming
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
810191ff30 Btrfs: extent_map optimizations to cut down on CPU usage
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
4dc119046d Btrfs: Add an extent buffer LRU to reduce radix tree hits
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
6b80053d02 Btrfs: Add back the online defragging code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
09e71a3263 Btrfs: Use an array of pages in the extent buffers to reduce the cost of find_get_page
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
db94535db7 Btrfs: Allow tree blocks larger than the page size
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
1a5bc167f6 Btrfs: Change the remaining radix trees used by extent-tree.c to extent_map trees
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
96b5179d0d Btrfs: Stop using radix trees for the block group cache
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
f510cfecfc Btrfs: Fix extent_buffer and extent_state leaks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
6d36dcd48f Btrfs: Avoid memcpy where possible in extent_buffers
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
5f39d397df Btrfs: Create extent_buffer interface for large blocksizes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
d3c2fdcf7b Btrfs: Use balance_dirty_pages_nr on btree blocks
btrfs_btree_balance_dirty is changed to pass the number of pages dirtied
for more accurate dirty throttling.  This lets the VM make better decisions
about when to force some writeback.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:00:48 -04:00
Chris Mason
5ce14bbcdd Btrfs: Find and remove dead roots the first time a root is loaded.
Dead roots are trees left over after a crash, and they were either in the
process of being removed or were waiting to be removed when the box crashed.
Before, a search of the entire tree of root pointers was done on mount
looking for dead roots.  Now, the search is done the first time we load
a root.

This makes mount faster when there are a large number of snapshots, and it
enables the block accounting code to properly update the block counts on
the latest root as old versions of the root are reaped after a crash.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-09-11 11:15:39 -04:00
Chris Mason
011410bd85 Btrfs: Add more synchronization before creating a snapshot
File data checksums are only done during writepage, so we have to make sure
all pages are written when the snapshot is taken.  This also adds some
locking so that new writes don't race in and add new dirty pages.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-09-10 19:58:36 -04:00
Josef Bacik
58176a9604 Btrfs: Add per-root block accounting and sysfs entries
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-29 15:47:34 -04:00
Josef Bacik
15ee9bc7ed Btrfs: delay commits during fsync to allow more writers
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-10 16:22:09 -04:00
Chris Mason
26b8003f10 Btrfs: Replace extent tree preallocation code with some bit radix magic.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-08 20:17:12 -04:00
Chris Mason
6702ed490c Btrfs: Add run time btree defrag, and an ioctl to force btree defrag
This adds two types of btree defrag, a run time form that tries to
defrag recently allocated blocks in the btree when they are still in ram,
and an ioctl that forces defrag of all btree blocks.

File data blocks are not defragged yet, but this can make a huge difference
in sequential btree reads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-07 16:15:09 -04:00
Zach Brown
ec6b910fb3 Btrfs: trivial include fixups
Almost none of the files including module.h need to do so,
remove them.

Include sched.h in extent-tree.c to silence a warning about cond_resched()
being undeclared.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-07-11 10:00:37 -04:00
Chris Mason
ccd467d60e Btrfs: crash recovery fixes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-28 15:57:36 -04:00
Chris Mason
4b52dff6d3 Btrfs: Fix super block updates during transaction commit
The super block written during commit was not consistent with the state of
the trees.  This change adds an in-memory copy of the super so that we can
make sure to write out consistent data during a commit.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-26 10:06:50 -04:00
Chris Mason
5eda7b5e9b Btrfs: Add the ability to find and remove dead roots after a crash.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-22 14:16:25 -04:00
Chris Mason
54aa1f4dfd Btrfs: Audit callers and return codes to make sure -ENOSPC gets up the stack
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-22 14:16:25 -04:00
Chris Mason
11bd143fc8 Btrfs: Switch to libcrc32c to avoid problems with cryptomgr on highmem machines
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-22 14:16:24 -04:00
Chris Mason
8c2383c3dd Subject: Rework btrfs_file_write to only allocate while page locks are held
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-18 09:57:58 -04:00
Chris Mason
9ebefb180b Btrfs: patch queue: page_mkwrite
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-15 13:50:00 -04:00
Chris Mason
340887809d Btrfs: i386 fixes from axboe
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-12 11:36:58 -04:00
Chris Mason
6cbd557078 Btrfs: add GPLv2
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-12 09:07:21 -04:00
Chris Mason
5af3981c18 Btrfs: printk fixes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-12 07:50:13 -04:00
Chris Mason
39279cc3d2 Btrfs: split up super.c
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-12 06:35:45 -04:00
Chris Mason
0cf6c62017 Btrfs: remove device tree
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-09 09:22:25 -04:00
Chris Mason
facda1e787 Btrfs: get forced transaction commits via workqueue
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-08 18:11:48 -04:00
Chris Mason
08607c1b18 Btrfs: add compat ioctl
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-08 15:33:54 -04:00
Chris Mason
f9f3c6b666 Btrfs: 2.6.21-git fixes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-21 14:05:12 -04:00
Chris Mason
de428b63b1 Btrfs: allocator optimizations, truncate readahead
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-18 13:28:27 -04:00
Chris Mason
509659cde5 Btrfs: switch to crc32c instead of sha256
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-10 12:36:17 -04:00
Chris Mason
e37c9e6921 Btrfs: many allocator fixes, pretty solid
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-09 20:13:14 -04:00
Chris Mason
be74417553 Btrfs: more allocator enhancements
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-06 10:15:01 -04:00
Chris Mason
35b7e47610 Btrfs: fix page cache memory leak
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-02 15:53:43 -04:00
Chris Mason
090d18753c Btrfs: directory readahead
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-01 08:53:32 -04:00
Chris Mason
28b8bb9e00 Btrfs: allocator tweaks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-27 11:42:05 -04:00
Chris Mason
cd1bc4653d Btrfs: more block allocator work
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-27 10:08:34 -04:00
Chris Mason
9078a3e1e4 Btrfs: start of block group code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-26 16:46:15 -04:00
Chris Mason
f2458e1d8c Btrfs: change around extent-tree prealloc
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-25 15:52:25 -04:00
Chris Mason
4d77567309 Btrfs: add owner and type fields to the extents aand block headers
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-20 20:23:12 -04:00
Chris Mason
e66f709b15 Btrfs: write barriers on commit, balance level before split
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-20 13:16:02 -04:00
Chris Mason
8fd17795b2 Btrfs: early fsync support
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-19 21:01:03 -04:00
Chris Mason
236454dfff Btrfs: many file_write fixes, inline data
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-19 13:37:44 -04:00
Chris Mason
70b2befd0c Btrfs: rework csums and extent item ordering
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-17 15:39:32 -04:00