We could not use clean_io_failure in the direct IO path because it got the
filesystem information from the page structure, but the page in the direct
IO bio didn't have the filesystem information in its structure. So we need
modify it and pass all the information it need by parameters.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The data repair function of direct read will be implemented later, and some code
in bio_readpage_error will be reused, so split bio_readpage_error into
several functions which will be used in direct read repair later.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We forgot to free failure record and bio after submitting re-read bio failed,
fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.
But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file data
check in the final end io function to the sub-bio end io function, in which we can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
rw_devices counter is often used to tune the profile when doing chunk allocation,
so we should modify it under the chunk_mutex context to avoid getting wrong
chunk profile.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
For a missing device, we don't know it belong to which fs before we read its
fsid from the chunk tree. So we add them into the current fs device list at first.
When we get its fsid, we should move them to their own fs device list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we open a seed filesystem, if the degraded mount option is set, we continue to
mount the fs if we don't find some devices in the seed filesystem. But we should stop
mounting if other errors happen. Fix it
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The problem is:
Task0(device scan task) Task1(device replace task)
scan_one_device()
mutex_lock(&uuid_mutex)
device = find_device()
mutex_lock(&device_list_mutex)
lock_chunk()
rm_and_free_source_device
unlock_chunk()
mutex_unlock(&device_list_mutex)
check device
Destroying the target device if device replace fails also has the same problem.
We fix this problem by locking uuid_mutex during destroying source device or
target device, just like the device remove operation.
It is a temporary solution, we can fix this problem and make the code more
clear by atomic counter in the future.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We can build a new filesystem based a seed filesystem, and we need clone
the fs devices when we open the new filesystem. But someone might clear
the seed flag of the seed filesystem, then mount that filesystem and
remove some device. If we mount the new filesystem, we might access
a device list which was being changed when we clone the fs devices.
Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
There were several problems about chunk mutex usage:
- Lock chunk mutex when updating metadata. It would cause the nested
deadlock because updating metadata might need allocate new chunks
that need acquire chunk mutex. We remove chunk mutex at this case,
because b-tree lock and other lock mechanism can help us.
- ABBA deadlock occured between device_list_mutex and chunk_mutex.
When we update device status, we must acquire device_list_mutex at the
beginning, and then we might get chunk_mutex during the device status
update because we need allocate new chunks for metadata COW. But at
most place, we acquire chunk_mutex at first and then acquire device list
mutex. We need change the lock order.
- Some place we needn't acquire chunk_mutex. For example we needn't get
chunk_mutex when we free a empty seed fs_devices structure.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we get the fs information, we forgot to acquire the mutex of device list,
it might cause the problem we might access a device that was removed. Fix
it by acquiring the device list mutex.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We didn't protect the system chunk array when we added a new
system chunk into it, it would cause the array be corrupted
if someone remove/add some system chunk into array at the same
time. Fix it by chunk lock.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
->total_bytes,->disk_total_bytes,->bytes_used is protected by chunk
lock when we change them, but sometimes we read them without any lock,
and we might get unexpected value. We fix this problem like inode's
i_size.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We should update free_chunk_space in time when we allocate a new chunk,
not when we deal with the pending device update and block group insertion,
because we need the real free_chunk_space data to calculate the reserved
space, if we don't update it in time, we would consider the disk space which
has be allocated as free space, and would use it to do overcommit reservation.
Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We should update device->bytes_used in the lock context of
chunk_mutex, or we would get wrong data.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
During removing a device, we have modified free_chunk_space when we
shrink the device, so we needn't assign a new value to it after
the device shrink. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
device->bytes_used will be changed when allocating a new chunk, and
disk_total_size will be changed if resizing is successful.
Meanwhile, the on-disk super blocks of the previous transaction
might not be updated. Considering the consistency of the metadata
in the previous transaction, We should use the size in the previous
transaction to check if the super block is beyond the boundary
of the device.
Though it is not big problem because we don't use it now, but anyway
it is better that we make it be consistent with the common metadata,
maybe we will use it in the future.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
total_size will be changed when resizing a device, and disk_total_size
will be changed if resizing is successful. Meanwhile, the on-disk super
blocks of the previous transaction might not be updated. Considering
the consistency of the metadata in the previous transaction, We should
use the size in the previous transaction to check if the super block is
beyond the boundary of the device. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We didn't protect the assignment of the target device, it might cause the
problem that the super block update was skipped because we might find wrong
size of the target device during the assignment. Fix it by moving the
assignment sentences into the initialization function of the target device.
And there is another merit that we can check if the target device is suitable
more early.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The member variants - num_can_discard - of fs_devices structure
are set, but no one use them to do anything. so remove them.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
This comments became wrong after c3c532[bdi: add helper function for
doing init and register of a bdi for a file system], so remove them.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
When replaying a directory from the fsync log, if a directory entry
exists both in the fs/subvol tree and in the log, the directory's inode
got its i_size updated incorrectly, accounting for the dentry's name
twice.
Reproducer, from a test for xfstests:
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
touch $SCRATCH_MNT/foo
sync
touch $SCRATCH_MNT/bar
xfs_io -c "fsync" $SCRATCH_MNT
xfs_io -c "fsync" $SCRATCH_MNT/bar
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
[ -f $SCRATCH_MNT/foo ] || echo "file foo is missing"
[ -f $SCRATCH_MNT/bar ] || echo "file bar is missing"
_unmount_flakey
_check_scratch_fs $FLAKEY_DEV
The filesystem check at the end failed with the message:
"root 5 root dir 256 error".
A test case for xfstests follows.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
One of my tests shows that when we really don't have space to reclaim via
flush_space and also run out of space, this async reclaim work loops on adding
itself into the workqueue and keeps writing something to disk according to
iostat's results, and these writes mainly comes from commit_transaction which
writes super_block. This's unacceptable as it can be bad to disks, especially
memeory storages.
This adds a check to avoid the above situation.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
We have been iterating all references for each extent we have in a file when we
do fiemap to see if it is shared. This is fine when you have a few clones or a
few snapshots, but when you have 5k snapshots suddenly fiemap just sits there
and stares at you. So add btrfs_check_shared which will use the backref walking
code but will short circuit as soon as it finds a root or inode that doesn't
match the one we currently have. This makes fiemap on my testbox go from
looking at me blankly for a day to spitting out actual output in a reasonable
amount of time. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The behaviour of a 'chattr -c' consists of getting the current flags,
clearing the FS_COMPR_FL bit and then sending the result to the set
flags ioctl - this means the bit FS_NOCOMP_FL isn't set in the flags
passed to the ioctl. This results in the compression property not being
cleared from the inode - it was cleared only if the bit FS_NOCOMP_FL
was set in the received flags.
Reproducer:
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt && cd /mnt
$ mkdir a
$ chattr +c a
$ touch a/file
$ lsattr a/file
--------c------- a/file
$ chattr -c a
$ touch a/file2
$ lsattr a/file2
--------c------- a/file2
$ lsattr -d a
---------------- a
Reported-by: Andreas Schneider <asn@cryptomilk.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs-transacion:5657
[stack snip]
btrfs_bio_map()
btrfs_bio_counter_inc_blocked()
percpu_counter_inc(&fs_info->bio_counter) ###bio_counter > 0(A)
__btrfs_bio_map()
btrfs_dev_replace_lock()
mutex_lock(dev_replace->lock) ###wait mutex(B)
btrfs:32612
[stack snip]
btrfs_dev_replace_start()
btrfs_dev_replace_lock()
mutex_lock(dev_replace->lock) ###hold mutex(B)
btrfs_dev_replace_finishing()
btrfs_rm_dev_replace_blocked()
wait until percpu_counter_sum == 0 ###wait on bio_counter(A)
This bug can be triggered quite easily by the following test script:
http://pastebin.com/MQmb37Cy
This patch will fix the ABBA problem by calling
btrfs_dev_replace_unlock() before btrfs_rm_dev_replace_blocked().
The consistency of btrfs devices list and their superblocks is protected
by device_list_mutex, not btrfs_dev_replace_lock/unlock().
So it is safe the move btrfs_dev_replace_unlock() before
btrfs_rm_dev_replace_blocked().
Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <clm@fb.com>
We've defined a 'offset' out of bio_for_each_segment_all.
This is just a clean rename, no function changes.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_drop_snapshot() leaves subvolume qgroup items on disk after
completion. This can cause problems with snapshot creation. If a new
snapshot tries to claim the deleted subvolumes id, btrfs will get -EEXIST
from add_qgroup_item() and go read-only. The following commands will
reproduce this problem (assume btrfs is on /dev/sda and is mounted at
/btrfs)
mkfs.btrfs -f /dev/sda
mount -t btrfs /dev/sda /btrfs/
btrfs quota enable /btrfs/
btrfs su sna /btrfs/ /btrfs/snap
btrfs su de /btrfs/snap
sleep 45
umount /btrfs/
mount -t btrfs /dev/sda /btrfs/
We can fix this by catching -EEXIST in add_qgroup_item() and
initializing the existing items. We have the problem of orphaned
relation items being on disk from an old snapshot but that is outside
the scope of this patch.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
The map_start and map_len fields aren't used anywhere, so just remove
them. On a x86_64 system, this reduced sizeof(struct extent_buffer)
from 296 bytes to 280 bytes, and therefore 14 extent_buffer structs can
now fit into a page instead of 13.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Maximum xattr size can be up to nearly the leaf size. For an fs with a
leaf size larger than the page size, using kmalloc requires allocating
multiple pages that are contiguous, which might not be possible if
there's heavy memory fragmentation. Therefore fallback to vmalloc if
we fail to allocate with kmalloc. Also start with a smaller buffer size,
since xattr values typically are smaller than a page.
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Last user removed in commit "btrfs: disable strict file flushes for
renames and truncates" (8d875f95da).
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
While under random IO, a block group's free space cache eventually reaches
a state where it has a mix of extent entries and bitmap entries representing
free space regions.
As later free space regions are returned to the cache, some of them are merged
with existing extent entries if they are contiguous with them. But others are
not merged, because despite the existence of adjacent free space regions in
the cache, the merging doesn't happen because the existing free space regions
are represented in bitmap extents. Even when new free space regions are merged
with existing extent entries (enlarging the free space range they represent),
we create chances of having after an enlarged region that is contiguous with
some other region represented in a bitmap entry.
Both clustered and non-clustered space allocation work by iterating over our
extent and bitmap entries and skipping any that represents a region smaller
then the allocation request (and giving preference to extent entries before
bitmap entries). By having a contiguous free space region that is represented
by 2 (or more) entries (mix of extent and bitmap entries), we end up not
satisfying an allocation request with a size larger than the size of any of
the entries but no larger than the sum of their sizes. Making the caller assume
we're under a ENOSPC condition or force it to allocate multiple smaller space
regions (as we do for file data writes), which adds extra overhead and more
chances of causing fragmentation due to the smaller regions being all spread
apart from each other (more likely when under concurrency).
For example, if we have the following in the cache:
* extent entry representing free space range: [128Mb - 256Kb, 128Mb[
* bitmap entry covering the range [128Mb, 256Mb[, but only with the bits
representing the range [128Mb, 128Mb + 768Kb[ set - that is, only that
space in this 128Mb area is marked as free
An allocation request for 1Mb, starting at offset not greater than 128Mb - 256Kb,
would fail before, despite the existence of such contiguous free space area in the
cache. The caller could only allocate up to 768Kb of space at once and later another
256Kb (or vice-versa). In between each smaller allocation request, another task
working on a different file/inode might come in and take that space, preventing the
former task of getting a contiguous 1Mb region of free space.
Therefore this change implements the ability to move free space from bitmap
entries into existing and new free space regions represented with extent
entries. This is done when a space region is added to the cache.
A test was added to the sanity tests that explains in detail the issue too.
Some performance test results with compilebench on a 4 cores machine, with
32Gb of ram and using an HDD follow.
Test: compilebench -D /mnt -i 30 -r 1000 --makej
Before this change:
intial create total runs 30 avg 69.02 MB/s (user 0.28s sys 0.57s)
compile total runs 30 avg 314.96 MB/s (user 0.12s sys 0.25s)
read compiled tree total runs 3 avg 27.14 MB/s (user 1.52s sys 0.90s)
delete compiled tree total runs 30 avg 3.14 seconds (user 0.15s sys 0.66s)
After this change:
intial create total runs 30 avg 68.37 MB/s (user 0.29s sys 0.55s)
compile total runs 30 avg 382.83 MB/s (user 0.12s sys 0.24s)
read compiled tree total runs 3 avg 27.82 MB/s (user 1.45s sys 0.97s)
delete compiled tree total runs 30 avg 3.18 seconds (user 0.17s sys 0.65s)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
we are assigning number_devices to the total_bytes,
that's very confusing for a moment
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
there is no matching open parenthesis for the closing parenthesis
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
seed fs devices don't participate as rw_device, so don't increment
rw_devices when the device being handled belongs to a seed fs.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we replace all the seed device in the system there is
no point in just keeping the btrfs_fs_devices with out
any device
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
We are not updating sprout fs seed pointer when all seed device
is replaced. This patch will check if all seed device has been
replaced and then update the sprout pointer accordingly.
Same reproducer as in the previous patch would apply here.
And notice that btrfs_close_device will check if seed fs is
present and spits out the error with out this patch.
int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
{
::
seed_devices = fs_devices->seed;
::
while (seed_devices) {
fs_devices = seed_devices;
seed_devices = fs_devices->seed;
__btrfs_close_devices(fs_devices);
free_fs_devices(fs_devices);
}
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
reproducer:
reproducer:
mount /dev/sdb /btrfs
btrfs dev add /dev/sdc /btrfs
btrfs rep start -B /dev/sdb /dev/sdd /btrfs
umount /btrfs
WARNING: CPU: 0 PID: 3882 at fs/btrfs/volumes.c:892 __btrfs_close_devices+0x1c8/0x200 [btrfs]()
which is
WARN_ON(fs_devices->rw_devices);
The problem here is that we did not add one to the rw_devices when
we replace the seed device with a writable device.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
reproducer:
mount /dev/sdb /btrfs
btrfs dev add /dev/sdc /btrfs
btrfs rep start -B /dev/sdb /dev/sdd /btrfs
umount /btrfs
WARNING: CPU: 0 PID: 12661 at fs/btrfs/volumes.c:891 __btrfs_close_devices+0x1b0/0x200 [btrfs]()
::
__btrfs_close_devices()
::
WARN_ON(fs_devices->open_devices);
After the seed device has been replaced the new target device
is no more a seed device. So we need to update the device
numbers in the fs_devices as pointed by the fs_info.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
There is no logical change in this patch, just a preparatory patch,
so that changes can be easily reasoned.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
The issue was introduced in a79b7d4b3e,
adding allocation of extent_workers, so this stray check is surely not
meant to be a check of something else.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=82021
Reported-by: Maks Naumov <maksqwe1@ukr.net>
Signed-off-by: Andrey Utkin <andrey.krieger.utkin@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
None of the uses of btrfs_search_forward() need to have the path
nodes (level >= 1) read locked, only the leaf needs to be locked
while the caller processes it. Therefore make it return a path
with all nodes unlocked, except for the leaf.
This change is motivated by the observation that during a file
fsync we repeatdly call btrfs_search_forward() and process the
returned leaf while upper nodes of the returned path (level >= 1)
are read locked, which unnecessarily blocks other tasks that want
to write to the same fs/subvol btree.
Therefore instead of modifying the fsync code to unlock all nodes
with level >= 1 immediately after calling btrfs_search_forward(),
change btrfs_search_forward() to do it, so that it benefits all
callers.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
BTRFS_ATTR_RW could set the mode and be inline with BTRFS_ATTR
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
All that uses BTRFS_ATTR want mode to be set at 0444 so just do
it at the define. And few spacing alignments.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we need to cow a node, increase the write lock level and retry the
tree search, there's no point of changing the node locks in our path
to blocking mode, as we only waste time and unnecessarily wake up other
tasks waiting on the spinning locks (just to block them again shortly
after) because we release our path before repeating the tree search.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
In ctree.c:setup_items_for_insert(), we can unlock all nodes in our
path before we process the leaf (shift items and data, adjust data
offsets, etc). This allows for better btree concurrency, as we're
often holding a write lock on at least the node at level 1.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_lookup_csums_range() uses ALIGN() to check if "start"
and "end + 1" are aligned to "root->sectorsize". It's better to
replace these with IS_ALIGNED() for simplicity.
Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We want this to debug qgroup changes on live systems.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The member variants - latest_devid and latest_trans - of fs_devices structure
are set, but no one use them to do anything. so remove them.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The io error might happen during writing out the device stats, and the
device stats information and dirty flag would be update at that time,
but the current code didn't consider this case, just clear the dirty
flag, it would cause that we forgot to write out the new device stats
information. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The lock in btrfs_device structure was far away from its protected data, it would
make CPU load the cache line twice when we accessed them, move them together.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The super block generation of the seed devices is not the same as the
filesystem which sprouted from them because we don't update the super
block on the seed devices when we change that new filesystem. So we
should not use the generation of that new filesystem to check the super
block generation on the seed devices, Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
All the metadata in the seed devices has the same fsid as the fsid
of the seed filesystem which is on the seed device, so we should check
them by the current filesystem. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The transaction thread may want to do more work, namely it pokes the
cleaner ktread that will start processing uncleaned subvols.
This can be triggered by user via the 'btrfs fi sync' command, otherwise
there was a delay up to 30 seconds before the cleaner started to clean
old snapshots.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
inline data is stored from offset of @disk_bytenr in
struct btrfs_file_extent_item. So substracting total
size of struct btrfs_file_extent_item is wrong, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs could still inline file data if its size is same as
page size, so don't skip max value here.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
If flag NOCOMPRESS is set which means bad compression ratio,
we could avoid call cow_file_range_async() for this case earlier.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
If a file's compression ratios is bad, we will set NOCOMPRESS
flag for it, and it will skip compression for that inode next time.
However, if we remount fs to COMPRESS_FORCE, it still should try
if we could compress pages for that inode, this patch fix wrong
check for this problem.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Fix the following sparse warning:
fs/btrfs/send.c:518:51: warning: incorrect type in argument 2 (different address spaces)
fs/btrfs/send.c:518:51: expected char const [noderef] <asn:1>*<noident>
fs/btrfs/send.c:518:51: got char *
We can safely use (const char __user *) with set_fs(KERNEL_DS)
__force added to avoid sparse-all warning:
fs/btrfs/send.c:518:40: warning: cast adds address space to expression (<asn:1>)
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Zach Brown <zab@zabbo.net>
Signed-off-by: Chris Mason <clm@fb.com>
Use BUG_ON(x) rather than if(x) BUG();
The semantic patch that fixes this problem is as follows:
// <smpl>
@@ identifier x; @@
-if (x) BUG();
+BUG_ON(x);
// </smpl>
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Chris Mason <clm@fb.com>
`struct workspace' used for zlib compression contains two zlib
z_stream-s: `def_strm' used in zlib_compress_pages(), and `inf_strm'
used in zlib_decompress/zlib_decompress_biovec(). None of these
functions use `inf_strm' and `def_strm' simultaniously, meaning that
for every compress/decompress operation we need only one z_stream
(out of two available).
`inf_strm' and `def_strm' are different in size of ->workspace. For
inflate stream we vmalloc() zlib_inflate_workspacesize() bytes, for
deflate stream - zlib_deflate_workspacesize() bytes. On my system zlib
returns the following workspace sizes, correspondingly: 42312 and 268104
(+ guard pages).
Keep only one `z_stream' in `struct workspace' and use it for both
compression and decompression. Hence, instead of vmalloc() of two
z_stream->worskpace-s, allocate only one of size:
max(zlib_deflate_workspacesize(), zlib_inflate_workspacesize())
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
We were returning with 0 (success) because we weren't extracting the
error code from em (PTR_ERR(em)). Fix it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The tree field of struct extent_state was only used to figure out if
an extent state was connected to an inode's io tree or not. For this
we can just use the rb_node field itself.
On a x86_64 system with this change the sizeof(struct extent_state) is
reduced from 96 bytes down to 88 bytes, meaning that with a page size
of 4096 bytes we can now store 46 extent states per page instead of 42.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
replace IS_ERR/PTR_ERR
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>
Marc argued that if there are several btrfs filesystems mounted,
while users even don't know which filesystem hit the corrupted
errors something like generation verification failure.
Since @extent_buffer structure has a member @fs_info, let's output
btrfs device info.
Reported-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we mounted a seed filesystem with degraded option, and then added a new
device into the seed filesystem, then we found adding device failed because
of the IO failure.
Steps to reproduce:
# mkfs.btrfs -d raid1 -m raid1 <dev0> <dev1>
# btrfstune -S 1 <dev0>
# mount <dev0> -o degraded <mnt>
# btrfs device add -f <dev2> <mnt>
It is because the original didn't set the chunk on the seed device to be
read-only if the degraded flag was set. It was introduced by patch f48b90756,
which fixed the problem the raid1 filesystem became read-only after one device
of it was missing. But this fix method was not right, we should set the read-only
flag according to the number of the missing devices, not the degraded mount
option, if the number of the missing devices is less than the max error number
that the profile of the chunk tolerates, we don't set it to be read-only.
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs defragment will utilize COW feature, which means this
did not work for nodatacow option, this problem was detected
by xfstests generic/018 with nodatacow mount option.
Fix this problem by forcing cow for a extent with state
@EXTETN_DEFRAG setting.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
as in the disk add patch, disk detached from the volume must be
recorded in the syslog as well for the same reason.
Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
when we add a new disk to the mounted btrfs we don't record it
as of now, disk add is a critical change of btrfs configuration,
it must be recorded in the syslog to help offline investigations
of customer problems when reported.
Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Steps to reproduce:
# mkfs.btrfs -f /dev/sdb
# mount /dev/sdb /mnt -o compress-force=lzo
# mount /dev/sdb /mnt -o remount,compress=zlib
# cat /proc/mounts
Remounting from compress-force to compress could not clear compress-force
option. The problem is there is no way for users to clear compress-force
option separately.
Fix this problem by clearing @FORCE_COMPRESS flag when remounting to
compress=xxx.
Suggested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The form
(value + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT
is equivalent to
(value + PAGE_CACHE_SIZE - 1) / PAGE_CACHE_SIZE
The rest is a simple subsitution, no difference in the generated
assembly code.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The nodesize and leafsize were never of different values. Unify the
usage and make nodesize the one. Cleanup the redundant checks and
helpers.
Shaves a few bytes from .text:
text data bss dec hex filename
852418 24560 23112 900090 dbbfa btrfs.ko.before
851074 24584 23112 898770 db6d2 btrfs.ko.after
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_set_key_type and btrfs_key_type are used inconsistently along with
open coded variants. Other members of btrfs_key are accessed directly
without any helpers anyway.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
There's no user of the return value and we can get rid of the comment in
put_super.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The naming is confusing, generic yet used for a specific cache. Add a
prefix 'ino_' or rename appropriately.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The grace period is ended in two steps--first userland is notified that
the grace period is now long enough that any clients who have not yet
reclaimed can be safely forgotten, then we flip the switch that forbids
reclaims and allows new opens. I had to think a bit to convince myself
that the ordering was right here. Document it.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The attempt to automatically set a new grace period time at the end of
the grace period isn't really helpful. We'll probably shut down and
reboot before we actually make use of the new grace period time anyway.
So may as well leave it up to the init system to get this right.
This just confuses people when they see /proc/fs/nfsd/nfsv4gracetime
change from what they set it to.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In the case of v4.0 clients, we may call into the "create" client
tracking operation multiple times (once for each openowner). Upcalling
for each one of those is wasteful and slow however. We can skip doing
further "create" operations after the first one if we know that one has
already been done.
v4.1+ clients generally only call into this function once (on
RECLAIM_COMPLETE), and we can't skip upcalling on the create even if the
STABLE bit is set. Doing so would make it impossible for nfsdcltrack to
lift the grace period early since the timestamp has a different meaning
in the case where the client is expected to issue a RECLAIM_COMPLETE.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
The nfsdcltrack upcall doesn't utilize the NFSD4_CLIENT_STABLE flag,
which basically results in an upcall every time we call into the client
tracking ops.
Change it to set this bit on a successful "check" or "create" request,
and clear it on a "remove" request. Also, check to see if that bit is
set before upcalling on a "check" or "remove" request, and skip
upcalling appropriately, depending on its state.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
In a later patch, we want to add a flag that will allow us to reduce the
need for upcalls. In order to handle that correctly, we'll need to
ensure that racing upcalls for the same client can't occur. In practice
it should be rare for this to occur with a well-behaved client, but it
is possible.
Convert one of the bits in the cl_flags field to be an upcall bitlock,
and use it to ensure that upcalls for the same client are serialized.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
In order to support lifting the grace period early, we must tell
nfsdcltrack what sort of client the "create" upcall is for. We can't
reliably tell if a v4.0 client has completed reclaiming, so we can only
lift the grace period once all the v4.1+ clients have issued a
RECLAIM_COMPLETE and if there are no v4.0 clients.
Also, in order to lift the grace period, we have to tell userland when
the grace period started so that it can tell whether a RECLAIM_COMPLETE
has been issued for each client since then.
Since this is all optional info, we pass it along in environment
variables to the "init" and "create" upcalls. By doing this, we don't
need to revise the upcall format. The UMH upcall can simply make use of
this info if it happens to be present. If it's not then it can just
avoid lifting the grace period early.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Allow a privileged userland process to end the v4 grace period early.
Writing "Y", "y", or "1" to the file will cause the v4 grace period to
be lifted. The basic idea with this will be to allow the userland
client tracking program to lift the grace period once it knows that no
more clients will be reclaiming state.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Add a new procfile that will allow a (privileged) userland process to
end the NLM grace period early. The basic idea here will be to have
sm-notify write to this file, if it sent out no NOTIFY requests when
it runs. In that situation, we can generally expect that there will be
no reclaim requests so the grace period can be lifted early.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
As stated in RFC 5661, section 18.51.3:
Once a RECLAIM_COMPLETE is done, there can be no further reclaim
operations for locks whose scope is defined as having completed
recovery. Once the client sends RECLAIM_COMPLETE, the server will
not allow the client to do subsequent reclaims of locking state for
that scope and, if these are attempted, will return
NFS4ERR_NO_GRACE.
Ensure that we enforce that requirement.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Currently, all of the grace period handling is part of lockd. Eventually
though we'd like to be able to build v4-only servers, at which point
we'll need to put all of this elsewhere.
Move the code itself into fs/nfs_common and have it build a grace.ko
module. Then, rejigger the Kconfig options so that both nfsd and lockd
enable it automatically.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
MAXQUOTAS value defines maximum number of quota types VFS supports.
This isn't necessarily the number of types ocfs2 supports and with
addition of project quotas these two numbers stop matching. So make
ocfs2 use its private definition.
CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: ocfs2-devel@oss.oracle.com
Signed-off-by: Jan Kara <jack@suse.cz>
MAXQUOTAS value defines maximum number of quota types VFS supports.
This isn't necessarily the number of types reiserfs supports and with
addition of project quotas these two numbers stop matching. So make
reiserfs use its private definition.
CC: reiserfs-devel@vger.kernel.org
CC: Jeff Mahoney <jeffm@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
MAXQUOTAS value defines maximum number of quota types VFS supports. This
isn't necessarily the number of types ext3 supports and with addition of
project quotas these two numbers stop matching. So make ext3 use its
private definition.
CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Currently write(2) updating i_size and close(2) of the file can race in
such a way that udf_truncate_tail_extent() called from
udf_file_release() sees old i_size but already new extents added by the
running write call. This results in complaints like:
UDF-fs: warning (device vdb2): udf_truncate_tail_extent: Too long extent
after EOF in inode 877: i_size: 0 lbcount: 1073739776 extent 0+1073739776
UDF-fs: error (device vdb2): udf_truncate_tail_extent: Extent after EOF
in inode 877
Fix the problem by grabbing i_mutex in udf_file_release() to be sure
i_size is consistent with current state of extent list. Also avoid
truncating tail extent unnecessarily when the file is still open for
writing.
Signed-off-by: Jan Kara <jack@suse.cz>
When a ranged fsync finishes if there are still extent maps in the modified
list, still set the inode's logged_trans and last_log_commit. This is important
in case an inode is fsync'ed and unlinked in the same transaction, to ensure its
inode ref gets deleted from the log and the respective dentries in its parent
are deleted too from the log (if the parent directory was fsync'ed in the same
transaction).
Instead make btrfs_inode_in_log() return false if the list of modified extent
maps isn't empty.
This is an incremental on top of the v4 version of the patch:
"Btrfs: fix fsync data loss after a ranged fsync"
which was added to its v5, but didn't make it on time.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
A previous patch added a ->match_preparse() method to the key type. This is
allowed to override the function called by the iteration algorithm.
Therefore, we can just set a default that simply checks for an exact match of
the key description with the original criterion data and allow match_preparse
to override it as needed.
The key_type::match op is then redundant and can be removed, as can the
user_match() function.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
on large sparse files, a negative dentry hashing fix, a fix for
flock, and a bug fix relating to d_splice_alias usage. There are
also (patches 1 and 5) a couple of updates which are less
critical, but small and low risk.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIcBAABAgAGBQJUGAT8AAoJEMrg3m4a/8jSevIQAJh0sfUAzDZ2p1etZw2OR2bm
0ritvFVkOB4CZ97TS7DvItoqcke+e2ehT8ykIzCNZhJotld7jsitzuQm0ugsMvG2
ltrUil2eqyeBFt1dJk9m366+akbdocVhSpVL/6hLRJI9GTbfWr26jHwUTIQisq5u
Pdr42lLwwAPrj9D7GQRA1jSZLbA36teRGUuV+3CFxAWVgqG0kOmq7iGd27LZZ/hP
m4hjPvdczTfhM3kjNvohicLXQiXhsJdlOZsZKfMTSGmJ2WsMRKoK2PZx2eRmDEEe
ypfkboyfcHCJlbIx+x0qPRheYMC8a2TX3JMb3l6eaV0mZoeExB2kPHeBlfu1jddp
XwLvbs/fbfC96Q6WPKM3JdBDY9DBf5t30Is0VxNJeMH2ERIubaLrOACeD4P7V5mX
PIc3/bcDtNMOirgdd2NGxGR0TxmyhKs7pbtWC+e6Wud04qgsK1JtU8cNro0j8MRE
Rfay0RY+8OjHfDn+shQYl4mh8MF2u3JF3/ThcZjtDfAB0AIhF1Bs5Ihl12B45ov1
7ah0J+8Yk7gwH++KDFNv6XGNzjpjkbBEr9FGLLN3DNDysk4ibEYP3NwEf0Vnme7A
EhYv0IFgUbgdLrU3owA7rRh38A8W23cr78qRSM66/TEGYqBi9+4/DwLE4EIorGYA
mKSpLXhcNtCIrQGNCz3Q
=5NIi
-----END PGP SIGNATURE-----
Merge tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes
Pull gfs2 fixes from Steven Whitehouse:
"Here are a number of small fixes for GFS2.
There is a fix for FIEMAP on large sparse files, a negative dentry
hashing fix, a fix for flock, and a bug fix relating to d_splice_alias
usage.
There are also (patches 1 and 5) a couple of updates which are less
critical, but small and low risk"
* tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
GFS2: fix d_splice_alias() misuses
GFS2: Don't use MAXQUOTAS value
GFS2: Hash the negative dentry during inode lookup
GFS2: Request demote when a "try" flock fails
GFS2: Change maxlen variables to size_t
GFS2: fs/gfs2/super.c: replace seq_printf by seq_puts
Commit d6bb3e9075 ("vfs: simplify and shrink stack frame of
link_path_walk()") introduced build problems with GCC versions older
than 4.6 due to the initialisation of a member of an anonymous union in
struct qstr without enclosing braces.
This hits GCC bug 10676 [1] (which was fixed in GCC 4.6 by [2]), and
causes the following build error:
fs/namei.c: In function 'link_path_walk':
fs/namei.c:1778: error: unknown field 'hash_len' specified in initializer
This is worked around by adding explicit braces.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676
[2] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=159206
Fixes: d6bb3e9075 (vfs: simplify and shrink stack frame of link_path_walk())
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-metag@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the mfsymlinks file size has changed (e.g. the file no longer
represents an emulated symlink) we were not returning an error properly.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Stefan Metzmacher <metze@samba.org>
If the inode is same and its data index are needed to truncate, we can fall into
double lock for its inode page via get_dnode_of_data.
Error case is like this.
1. write data 1, 2, 3, 4, 5 in inode #4.
2. write data 100, 102, 103, 104, 105 in dnode #6 of inode #4.
3. sync
4. update data 100->106 in dnode #6.
5. fsync inode #4.
6. power-cut
-> Then,
1. go back to #3's checkpoint
2. in do_recover_data, get_dnode_of_data() gets inode #4.
3. detect 100->106 in dnode #6.
4. check_index_in_prev_nodes tries to truncate 100 in dnode #6.
5. to trigger truncate_hole, get_dnode_of_data should grab inode #4.
6. detect *kernel hang*
This patch should resolve that bug.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The nm_i->fcnt checking is executed before spin_lock, so if another
thread delete the last free_nid from the list, the wrong nid may be
gotten. So fix the race condition by moving the nm_i->fnct checking
into spin_lock.
Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, if there is no free nid in nm_i->free_nid_list, 0 may be saved
into next_free_nid of checkpoint, this may cause useless scanning for
next mount. nm_i->next_scan_nid should be a better default value than
0.
Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If user wrote F2FS_IPU_FSYNC:4 in /sys/fs/f2fs/ipu_policy, f2fs_sync_file
only starts to try in-place-updates.
And, if the number of dirty pages is over /sys/fs/f2fs/min_fsync_blocks, it
keeps out-of-order manner. Otherwise, it triggers in-place-updates.
This may be used by storage showing very high random write performance.
For example, it can be used when,
Seq. writes (Data) + wait + Seq. writes (Node)
is pretty much slower than,
Rand. writes (Data)
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously f2fs only counts dirty dentry pages, but there is no reason not to
expand the scope.
This patch changes the names on the management of dirty pages and to count
dirty pages in each inode info as well.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
cifs provides two dummy functions 'sess_auth_lanman' and
'sess_auth_kerberos' for the case in which the respective
features are not defined. However, the caller is also under
an #ifdef, so we just get warnings about unused code:
fs/cifs/sess.c:1109:1: warning: 'sess_auth_kerberos' defined but not used [-Wunused-function]
sess_auth_kerberos(struct sess_data *sess_data)
Removing the dead functions gets rid of the warnings without
any downsides that I can see.
(Yalin Wang reported the identical problem and fix so added him)
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
Signed-off-by: Steve French <smfrench@gmail.com>
This reverts commit 52a3624444.
Causes rmmod to fail for at least 7 seconds after unmount which
makes automated testing a little harder when reloading cifs.ko
between test runs.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
CC: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Commit 9226b5b440 ("vfs: avoid non-forwarding large load after small
store in path lookup") made link_path_walk() always access the
"hash_len" field as a single 64-bit entity, in order to avoid mixed size
accesses to the members.
However, what I didn't notice was that that effectively means that the
whole "struct qstr this" is now basically redundant. We already
explicitly track the "const char *name", and if we just use "u64
hash_len" instead of "long len", there is nothing else left of the
"struct qstr".
We do end up wanting the "struct qstr" if we have a filesystem with a
"d_hash()" function, but that's a rare case, and we might as well then
just squirrell away the name and hash_len at that point.
End result: fewer live variables in the loop, a smaller stack frame, and
better code generation. And we don't need to pass in pointers variables
to helper functions any more, because the return value contains all the
relevant information. So this removes more lines than it adds, and the
source code is clearer too.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We were not checking for symlink support properly for SMB2/SMB3
mounts so could oops when mounted with mfsymlinks when try
to create symlink when mfsymlinks on smb2/smb3 mounts
Signed-off-by: Steve French <smfrench@gmail.com>
Cc: <stable@vger.kernel.org> # 3.14+
CC: Sachin Prabhu <sprabhu@redhat.com>
Pull vfs fixes from Al Viro:
"double iput() on failure exit in lustre, racy removal of spliced
dentries from ->s_anon in __d_materialise_dentry() plus a bunch of
assorted RCU pathwalk fixes"
The RCU pathwalk fixes end up fixing a couple of cases where we
incorrectly dropped out of RCU walking, due to incorrect initialization
and testing of the sequence locks in some corner cases. Since dropping
out of RCU walk mode forces the slow locked accesses, those corner cases
slowed down quite dramatically.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
be careful with nd->inode in path_init() and follow_dotdot_rcu()
don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu()
fix bogus read_seqretry() checks introduced in b37199e
move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon)
[fix] lustre: d_make_root() does iput() on dentry allocation failure
The performance regression that Josef Bacik reported in the pathname
lookup (see commit 99d263d4c5 "vfs: fix bad hashing of dentries") made
me look at performance stability of the dcache code, just to verify that
the problem was actually fixed. That turned up a few other problems in
this area.
There are a few cases where we exit RCU lookup mode and go to the slow
serializing case when we shouldn't, Al has fixed those and they'll come
in with the next VFS pull.
But my performance verification also shows that link_path_walk() turns
out to have a very unfortunate 32-bit store of the length and hash of
the name we look up, followed by a 64-bit read of the combined hash_len
field. That screws up the processor store to load forwarding, causing
an unnecessary hickup in this critical routine.
It's caused by the ugly calling convention for the "hash_name()"
function, and easily fixed by just making hash_name() fill in the whole
'struct qstr' rather than passing it a pointer to just the hash value.
With that, the profile for this function looks much smoother.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xfstest generic/258 sets the time on a file to a negative value
(before 1970) which fails since do_div can not handle negative
numbers. In addition 'normal' division of 64 bit values does
not build on 32 bit arch so have to workaround this by special
casing negative values in cifs_NTtimeToUnix
Samba server also has a bug with this (see samba bugzilla 7771)
but it works to Windows server.
Signed-off-by: Steve French <smfrench@gmail.com>
in the former we simply check if dentry is still valid after picking
its ->d_inode; in the latter we fetch ->d_inode in the same places
where we fetch dentry and its ->d_seq, under the same checks.
Cc: stable@vger.kernel.org # 2.6.38+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
return the value instead, and have path_init() do the assignment. Broken by
"vfs: Fix absolute RCU path walk failures due to uninitialized seq number",
which was Cc-stable with 2.6.38+ as destination. This one should go where
it went.
To avoid dummy value returned in case when root is already set (it would do
no harm, actually, since the only caller that doesn't ignore the return value
is guaranteed to have nd->root *not* set, but it's more obvious that way),
lift the check into callers. And do the same to set_root(), to keep them
in sync.
Cc: stable@vger.kernel.org # 2.6.38+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Josef Bacik found a performance regression between 3.2 and 3.10 and
narrowed it down to commit bfcfaa77bd ("vfs: use 'unsigned long'
accesses for dcache name comparison and hashing"). He reports:
"The test case is essentially
for (i = 0; i < 1000000; i++)
mkdir("a$i");
On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
dir/sec with 3.10. This is because we spend waaaaay more time in
__d_lookup on 3.10 than in 3.2.
The new hashing function for strings is suboptimal for <
sizeof(unsigned long) string names (and hell even > sizeof(unsigned
long) string names that I've tested). I broke out the old hashing
function and the new one into a userspace helper to get real numbers
and this is what I'm getting:
Old hash table had 1000000 entries, 0 dupes, 0 max dupes
New hash table had 12628 entries, 987372 dupes, 900 max dupes
We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash
My test does the hash, and then does the d_hash into a integer pointer
array the same size as the dentry hash table on my system, and then
just increments the value at the address we got to see how many
entries we overlap with.
As you can see the old hash function ended up with all 1 million
entries in their own bucket, whereas the new one they are only
distributed among ~12.5k buckets, which is why we're using so much
more CPU in __d_lookup".
The reason for this hash regression is two-fold:
- On 64-bit architectures the down-mixing of the original 64-bit
word-at-a-time hash into the final 32-bit hash value is very
simplistic and suboptimal, and just adds the two 32-bit parts
together.
In particular, because there is no bit shuffling and the mixing
boundary is also a byte boundary, similar character patterns in the
low and high word easily end up just canceling each other out.
- the old byte-at-a-time hash mixed each byte into the final hash as it
hashed the path component name, resulting in the low bits of the hash
generally being a good source of hash data. That is not true for the
word-at-a-time case, and the hash data is distributed among all the
bits.
The fix is the same in both cases: do a better job of mixing the bits up
and using as much of the hash data as possible. We already have the
"hash_32|64()" functions to do that.
Reported-by: Josef Bacik <jbacik@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Callers of d_splice_alias(dentry, inode) don't need iput(), neither
on success nor on failure. Either the reference to inode is stored
in a previously negative dentry, or it's dropped. In either case
inode reference the caller used to hold is consumed.
__gfs2_lookup() does iput() in case when d_splice_alias() has failed.
Double iput() if we ever hit that. And gfs2_create_inode() ends up
not only with double iput(), but with link count dropped to zero - on
an inode it has just found in directory.
Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Pull btrfs fixes from Chris Mason:
"Filipe is doing a careful pass through fsync problems, and these are
the fixes so far. I'll have one more for rc6 that we're still
testing.
My big commit is fixing up some inode hash races that Al Viro found
(thanks Al)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: use insert_inode_locked4 for inode creation
Btrfs: fix fsync data loss after a ranged fsync
Btrfs: kfree()ing ERR_PTRs
Btrfs: fix crash while doing a ranged fsync
Btrfs: fix corruption after write/fsync failure + fsync + log recovery
Btrfs: fix autodefrag with compression
Both blocks layout and objects layout want to use it to avoid CB_LAYOUTRECALL
but that should only happen if client is doing truncation to a smaller size.
For other cases, we let server decide if it wants to recall client's layouts.
Change PNFS_LAYOUTRET_ON_SETATTR to follow the logic and not to send
layoutreturn unnecessarily.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This code is internal to the v3 module, so other parts of the client
shouldn't have any knowledge of it.
nfs3_getxattr(), nfs3_setxattr(), and nfs3_removexattr() no longer exist
anywhere so I remove the declarations while I'm here.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This check is already performed by the module loading code - if the
module can't be found then -EPROTONOSUPPORT will be returned. Let's
handle v3 this way, too.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
I am generally against the "one big header file" approach, and
everything in the client includes this file. Let's move all the NFS v3
declarations into a v3-only header file.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The goal is to create a generic NFS module with code that does not
depend on what versions of NFS are enabled.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This code has been around for a while, but never was enabled, although
it is in a working shape.
Note that we implement NOTIFY_DEVICEID4_CHANGE identical to
NOTIFY_DEVICEID4_DELETE. Given that in either case we can't do anything
but preventing further lookups of a given device ID there isn't much difference
in semantics for the two. For the delete case the server MUST ensure that
there are no outstanding layouts, while for the change case it doesn't, but
that has little relevance to the client.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This patches moves parsing of the GETDEVICEINFO XDR to kernel space, as well
as the management of complex devices. The reason for that is we might have
multiple outstanding complex devices after a NOTIFY_DEVICEID4_CHANGE, which
device mapper or md can't handle as they claim devices exclusively.
But as is turns out simple striping / concatenation is fairly trivial to
implement anyway, so we make our life simpler by reducing the reliance
on blkmapd. For now we still use blkmapd by feeding it synthetic SIMPLE
device XDR to translate device signatures to device numbers, but in the
long runs I have plans to eliminate it entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Create a file to house all the rpc_pipefs boilerplate code instead of
sprinkling it over a few files.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Factor out a helper for all per-extent work, and merge the now trivial
functions for lseg allocation and parsing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This isn't device(id) related, so move it into the main file. Simple move
for now, the next commit will clean it up a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead of overflowing the XDR send buffer with our extent list allocate
pages and pre-encode the layoutupdate payload into them. We optimistically
allocate a single page use alloc_page and only switch to vmalloc when we
have more extents outstanding. Currently there is only a single testcase
(xfstests generic/113) which can reproduce large enough extent lists for
this to occur.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The current GETDEVICELIST implementation is buggy in that it doesn't handle
cursors correctly, and in that it returns an error if the server returns
NFSERR_NOTSUPP. Given that there is no actual need for GETDEVICELIST,
it has various issues and might get removed for NFSv4.2 stop using it in
the blocklayout driver, and thus the Linux NFS client as whole.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The kbuild test robot complained about a new sparse warning in
objio_alloc_deviceid_node, but it turns out that this was just a moved
reference to an existing variable. Fix it to have the right big endian
annotated type.
Note that there are some other endianess issues in this file that I didn't
bother to sort out as they involve global headers.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The kbuild test robot complained that we got the printk format wrong.
Let's just kill these printks instead of fixing them as there is not
point after the initial tree algorithm debugging.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
commit 4fa2c54b51
NFS: nfs4_do_open should add negative results to the dcache.
used "d_drop(); d_add();" to ensure that a dentry was hashed
as a negative cached entry.
This is not safe if the dentry has an non-NULL ->d_inode.
It will trigger a BUG_ON in d_instantiate().
In that case, d_delete() is needed.
Also, only d_add if the dentry is currently unhashed, it seems
pointless removed and re-adding it unchanged.
Reported-by: Christoph Hellwig <hch@infradead.org>
Fixes: 4fa2c54b51
Cc: Jeff Layton <jeff.layton@primarydata.com>
Link: http://lkml.kernel.org/r/20140908144525.GB19811@infradead.org
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
A bit of churn on the for-linus side that would be nice to have
in the core bits for 3.18, so pull it in to catch us up and make
forward progress easier.
Signed-off-by: Jens Axboe <axboe@fb.com>
Conflicts:
block/scsi_ioctl.c
This fixes a failure in xfstests generic/313 because nfs doesn't update
mtime on a truncate. The protocol requires this to be done implicity
for a size changing setattr.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
MAXQUOTAS value defines maximum number of quota types VFS supports.
This isn't necessarily the number of types gfs2 supports and with
addition of project quotas these two numbers stop matching. So make gfs2
use its private definition.
CC: cluster-devel@redhat.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Fix a regression introduced by:
6d4ade986f GFS2: Add atomic_open support
where an early return misses d_splice_alias() which had been
adding the negative dentry.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Merge misc fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
fs/notify: don't show f_handle if exportfs_encode_inode_fh failed
fsnotify/fdinfo: use named constants instead of hardcoded values
kcmp: fix standard comparison bug
mm/mmap.c: use pr_emerg when printing BUG related information
shm: add memfd.h to UAPI export list
checkpatch: allow commit descriptions on separate line from commit id
sh: get_user_pages_fast() must flush cache
eventpoll: fix uninitialized variable in epoll_ctl
kernel/printk/printk.c: fix faulty logic in the case of recursive printk
mem-hotplug: let memblock skip the hotpluggable memory regions in __next_mem_range()
Currently we handle only ENOSPC. In case of other errors the file_handle
variable isn't filled properly and we will show a part of stack.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MAX_HANDLE_SZ is equal to 128, but currently the size of pad is only 64
bytes, so exportfs_encode_inode_fh can return an error.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When calling epoll_ctl with operation EPOLL_CTL_DEL, structure epds is
not initialized but ep_take_care_of_epollwakeup reads its event field.
When this unintialized field has EPOLLWAKEUP bit set, a capability check
is done for CAP_BLOCK_SUSPEND in ep_take_care_of_epollwakeup. This
produces unexpected messages in the audit log, such as (on a system
running SELinux):
type=AVC msg=audit(1408212798.866:410): avc: denied
{ block_suspend } for pid=7754 comm="dbus-daemon" capability=36
scontext=unconfined_u:unconfined_r:unconfined_t
tcontext=unconfined_u:unconfined_r:unconfined_t
tclass=capability2 permissive=1
type=SYSCALL msg=audit(1408212798.866:410): arch=c000003e syscall=233
success=yes exit=0 a0=3 a1=2 a2=9 a3=7fffd4d66ec0 items=0 ppid=1
pid=7754 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
fsgid=0 tty=(none) ses=3 comm="dbus-daemon"
exe="/usr/bin/dbus-daemon"
subj=unconfined_u:unconfined_r:unconfined_t key=(null)
("arch=c000003e syscall=233 a1=2" means "epoll_ctl(op=EPOLL_CTL_DEL)")
Remove use of epds in epoll_ctl when op == EPOLL_CTL_DEL.
Fixes: 4d7e30d989 ("epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready")
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull UDF fixes from Jan Kara:
"Fixes for UDF handling of NFS handles and one fix for proper handling
of corrupted media"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: saner calling conventions for udf_new_inode()
udf: fix the udf_iget() vs. udf_new_inode() races
udf: merge the pieces inserting a new non-directory object into directory
udf: Set i_generation field
udf: Properly detect stale inodes
udf: Make udf_read_inode() and udf_iget() return error
udf: Avoid infinite loop when processing indirect ICBs
udf: Fold udf_fill_inode() into __udf_read_inode()
udf: Avoid dir link count to go negative
sparse says:
fs/nfs/file.c:543:60: warning: incorrect type in argument 1 (different address spaces)
fs/nfs/file.c:543:60: expected struct rpc_xprt *xprt
fs/nfs/file.c:543:60: got struct rpc_xprt [noderef] <asn:4>*cl_xprt
fs/nfs/file.c:548:53: warning: incorrect type in argument 1 (different address spaces)
fs/nfs/file.c:548:53: expected struct rpc_xprt *xprt
fs/nfs/file.c:548:53: got struct rpc_xprt [noderef] <asn:4>*cl_xprt
cl_xprt is RCU-managed, so we need to take care to dereference and use
it while holding the RCU read lock.
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The VFS never calls setattr with ATTR_SIZE on anything but regular
files. Remove the if check and turn it into an assert similar to
what some other file systems do.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This will be used by the block layout driver when splitting extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
At a simple helper to issue a GETDEVICELIST operation and pre-load
the device id cache based on the result.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Add support to the common pNFS core to issue GETDEVICEINFO calls on
a device ID cache miss. The code is taken from the well debugged
file layout implementation and calls out to the layoutdriver through
a new alloc_deviceid_node method. The calling conventions for
nfs4_find_get_deviceid are changed so that all information needed to
send a GETDEVICEINFO request is passed to the common code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This speads up truncate-heavy workloads like fsx by multiple orders of
magnitude.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This allows removing extents from the extent tree especially on truncate
operations, and thus fixing reads from truncated and re-extended that
previously returned stale data.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Currently the block layout driver tracks extents in three separate
data structures:
- the two list of pnfs_block_extent structures returned by the server
- the list of sectors that were in invalid state but have been written to
- a list of pnfs_block_short_extent structures for LAYOUTCOMMIT
All of these share the property that they are not only highly inefficient
data structures, but also that operations on them are even more inefficient
than nessecary.
In addition there are various implementation defects like:
- using an int to track sectors, causing corruption for large offsets
- incorrect normalization of page or block granularity ranges
- insufficient error handling
- incorrect synchronization as extents can be modified while they are in
use
This patch replace all three data with a single unified rbtree structure
tracking all extents, as well as their in-memory state, although we still
need to instance for read-only and read-write extent due to the arcane
client side COW feature in the block layouts spec.
To fix the problem of extent possibly being modified while in use we make
sure to return a copy of the extent for use in the write path - the
extent can only be invalidated by a layout recall or return which has
to wait until the I/O operations finished due to refcounts on the layout
segment.
The new extent tree work similar to the schemes used by block based
filesystems like XFS or ext4.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The core nfs code handles setting pages uptodate on reads, no need to mess
with the pageflags outselves. Also remove a debug function to dump page
flags.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Use the new PNFS_READ_WHOLE_PAGE flag to offload read-modify-write
handling to core nfs code, and remove a huge chunk of deadlock prone
mess from the block layout writeback path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If a layout driver keeps per-inode state outside of the layout segments it
needs to be notified of any layout returns or recalls on an inode, and not
just about the freeing of layout segments. Add a method to acomplish this,
which will allow the block layout driver to handle the case of truncated
and re-expanded files properly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Like all block based filesystems, the pNFS block layout driver can't read
or write at a byte granularity and thus has to perform read-modify-write
cycles on writes smaller than this granularity.
Add a flag so that the core NFS code always reads a whole page when
starting a smaller write, so that we can do it in the place where the VFS
expects it instead of doing in very deadlock prone way in the writeback
handler.
Note that in theory we could do less than page size reads here for disks
that have a smaller sector size which are served by a server with a smaller
pnfs block size. But so far that doesn't seem like a worthwhile
optimization.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Expedite layout recall processing by forcing a layout commit when
we see busy segments. Without it the layout recall might have to wait
until the VM decided to start writeback for the file, which can introduce
long delays.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
gcc reports:
linux/fs/nfs/write.c: In function ‘nfs_page_find_head_request_locked.isra.17’:
linux/fs/nfs/write.c:121:64: warning: ‘cinfo.mds’ may be used uninitialized in this function [-Wmaybe-uninitialized]
list_for_each_entry_safe(freq, t, &cinfo.mds->list, wb_list) {
^
linux/fs/nfs/write.c:110:25: note: ‘cinfo.mds’ was declared here
struct nfs_commit_info cinfo;
Reported-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Cc: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When we do non-page sized reads we can underflow the extent_length variable
and read incorrect data. Fix the extent_length calculation and change to
defensive <= checks for the extent length in the read and write path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Make sure the block queue is plugged when performing pNFS blocklayout I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tell userspace what stage of GETDEVICEINFO failed so that there is a chance
to debug it, especially with the userspace daemon clusterf***k in the block
layout driver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The Linux VM subsystem can't support block sizes larger than page size
for block based filesystems very well. While this can be hacked around
to some extent for simple filesystems the read-modify-write cycles
required for pnfs block invalid extents are extremly deadlock prone
when operating on multiple pages. Reject this case early on instead
of pretending to support it (badly).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Currently there is no XDR buffer space allocated for the per-layout driver
layoutcommit payload, which leads to server buffer overflows in the
blocklayout driver even under simple workloads. As we can't do per-layout
sizes for XDR operations we'll have to splice a previously encoded list
of pages into the XDR stream, similar to how we handle ACL buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
After we issued a layoutreturn operations the may free the layout stateid
and will thus cause bad stateid error when the client uses it again.
We currently try to avoid this case by chosing the open stateid if not
lsegs are present for this inode. But various places can hold refererence
on lsegs and thus cause the list not to be empty shortly after a layout
return. Add an explicit flag to mark the current layout stateid invalid
and force usage of the openstateid after we did a full file layoutreturn.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Currently we fall through to nfs4_async_handle_error when we get
a bad stateid error back from layoutget. nfs4_async_handle_error
with a NULL state argument will never retry the operations but return
the error to higher layer, causing an avoiable fallback to MDS I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When layoutget returns an entirely new layout stateid it should not
check the generation counter as the new stateid will start with a new
counter entirely unrelated to old one.
The current behavior causes constant layoutget failures against a block
server which allocates a new stateid after an recall that removed all
outstanding layouts.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Ensure the lsegs are initialized early so that we don't pass an unitialized
one back to ->free_lseg during error processing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
pNFS servers may return arbitrarily large layouts. Trim back the I/O size
to one that we can at least allocate the page array for.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Following http://www.rfc-editor.org/errata_search.php?rfc=5661&eid=2751
Don't set layoutcommit for commit_through_mds case.
For FILE_SYNC writes, don't set layoutcommit.
For DATA_SYNC wirtes, set layout commit right after wirtes done.
For UNSTABLE writes, set layout commit when commit done.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Track lwb in nfs_commit_data so that we can use it to setup
layoutcommit in commit_done callback.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
can_open_cached() reads values out of the state structure, meaning that
we need the so_lock to have a correct return value. As a bonus, this
helps clear up some potentially confusing code.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
filelayout_retry_commit was recently split out from alloc_ds_commits,
but was done in such a way that the bucket pointer always starts at
index 0 no matter what the @idx argument is set to.
The intention of the @idx argument is to retry commits starting at
bucket @idx. This is called when alloc_ds_commits fails for a bucket.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull cifs/smb3 fixes from Steve French:
"This includes various cifs and smb3 bug fixes including those for bugs
found with the recently updated xfstests.
Also I am working fixes for two additional cifs problems found by
xfstests which I plan to send later (when reviewed and run additional
tests)"
* 'for-next-3.17' of git://git.samba.org/sfrench/cifs-2.6:
Clarify Kconfig help text for CIFS and SMB2/SMB3
CIFS: Fix wrong filename length for SMB2
CIFS: Fix wrong restart readdir for SMB1
CIFS: Fix directory rename error
cifs: No need to send SIGKILL to demux_thread during umount
cifs: Allow directIO read/write during cache=strict
cifs: remove unneeded check of null checking in if condition
cifs: fix a possible use of uninit variable in SMB2_sess_setup
cifs: fix memory leak when password is supplied multiple times
cifs: fix a possible null pointer deref in decode_ascii_ssetup
Trivial whitespace fix
If application throws negative value of lseek with SEEK_DATA|SEEK_HOLE,
previous f2fs went into BUG_ON in get_dnode_of_data, which was reported
by Tommi Rantala.
He could make a simple code to detect this having:
lseek(fd, -17595150933902LL, SEEK_DATA);
This patch should resolve that bug.
Reported-by: Tommi Rentala <tt.rantala@gmail.com>
[Jaegeuk Kim: relocate the condition as suggested by Chao]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In gc_node_segment, if node page gc is run concurrently with node page
writeback, and check_valid_map and get_node_page run after page locked
and before cur_valid_map is updated as below, it is possible for the
page to be written twice unnecessarily.
sync_node_pages
try_lock_page
...
check_valid_map f2fs_write_node_page
...
write_node_page
do_write_page
allocate_data_block
...
refresh_sit_entry /* update cur_valid_map */
...
...
unlock_page
get_node_page
...
set_page_dirty
...
f2fs_put_page
unlock_page
This can be solved via calling check_valid_map after get_node_page again.
Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We use flush cmd control to collect many flush cmds, and flush them
together. In this case, we use two list to manage the flush cmds
(collect and dispatch), and one spin lock is used to protect this.
In fact, the lock-less list(llist) is very suitable to this case,
and we use simplify this routine.
-
v2:
-use llist_for_each_entry_safe to fix possible use-after-free issue.
-remove the unused field from struct flush_cmd.
Thanks for Yu's suggestion.
-
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In commit aec71382c6 ("f2fs: refactor flush_nat_entries codes for reducing NAT
writes"), we descripte the issue as below:
"Although building NAT journal in cursum reduce the read/write work for NAT
block, but previous design leave us lower performance when write checkpoint
frequently for these cases:
1. if journal in cursum has already full, it's a bit of waste that we flush all
nat entries to page for persistence, but not to cache any entries.
2. if journal in cursum is not full, we fill nat entries to journal util
journal is full, then flush the left dirty entries to disk without merge
journaled entries, so these journaled entries may be flushed to disk at next
checkpoint but lost chance to flushed last time."
Actually, we have the same problem in using SIT journal area.
In this patch, firstly we will update sit journal with dirty entries as many as
possible. Secondly if there is no space in sit journal, we will remove all
entries in journal and walk through the whole dirty entry bitmap of sit,
accounting dirty sit entries located in same SIT block to sit entry set. All
entry sets are linked to list sit_entry_set in sm_info, sorted ascending order
by count of entries in set. Later we flush entries in set which have fewest
entries into journal as many as we can, and then flush dense set with merged
entries to disk.
In this way we can use sit journal area more effectively, also we will reduce
SIT update, result in gaining in performance and saving lifetime of flash
device.
In my testing environment, it shows this patch can help to reduce SIT block
update obviously.
virtual machine + hard disk:
fsstress -p 20 -n 400 -l 5
sit page num cp count sit pages/cp
based 2006.50 1349.75 1.486
patched 1566.25 1463.25 1.070
Our latency of merging op is small when handling a great number of dirty SIT
entries in flush_sit_entries:
latency(ns) dirty sit count
36038 2151
49168 2123
37174 2232
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
sit_i in macro SIT_BLOCK_OFFSET/START_SEGNO is not used, remove it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch replaces BUG cases with f2fs_bug_on to remain fsck.f2fs information.
And it implements some void functions to initiate fsck.f2fs too.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
security_file_set_fowner always returns 0, so make it f_setown and
__f_setown void return functions and fix up the error handling in the
callers.
Cc: linux-security-module@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
GFS2 and NFS have setlease routines that always just return -EINVAL.
Turn that into a generic routine that can live in fs/libfs.c.
Cc: <linux-nfs@vger.kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: <cluster-devel@redhat.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
As Kinglong points out, the nlm_block->b_fl field is no longer used at
all. Also, vfs_test_lock in the generic locking code will only return
FILE_LOCK_DEFERRED if FL_SLEEP is set, and it isn't here.
The only other place that returns that value is the DLM lock code, but
it only does that in dlm_posix_lock, never in dlm_posix_get.
Remove all of the deferred locking code from the testlock codepath
since it doesn't appear to ever be used anyway.
I do have a small concern that this might cause a behavior change in the
case where you have a block already sitting on the list when the
testlock request comes in, but that looks like it doesn't really work
properly anyway. I think it's best to just pass that down to
vfs_test_lock and let the filesystem report that instead of trying to
infer what's going on with the lock by looking at an existing block.
Cc: cluster-devel@redhat.com
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
v5: using nfs4_get_stateowner() instead of an inline function
v3: Update based on Jeff's comments
v2: Fix bad using of struct file_lock_operations for handle the owner
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
v5: same as the first version
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Commit d5b9026a67 ([PATCH] knfsd: locks: flag NFSv4-owned locks) using
fl_lmops field in file_lock for checking nfsd4 lockowner.
But, commit 1a747ee0cc (locks: don't call ->copy_lock methods on return
of conflicting locks) causes the fl_lmops of conflock always be NULL.
Also, commit 0996905f93 (lockd: posix_test_lock() should not call
locks_copy_lock()) caused the fl_lmops of conflock always be NULL too.
Make sure copy the private information by fl_copy_lock() in struct
file_lock_operations, merge __locks_copy_lock() to fl_copy_lock().
Jeff advice, "Set fl_lmops on conflocks, but don't set fl_ops.
fl_ops are superfluous, since they are callbacks into the filesystem.
There should be no need to bother the filesystem at all with info
in a conflock. But, lock _ownership_ matters for conflocks and that's
indicated by the fl_lmops. So you really do want to copy the fl_lmops
for conflocks I think."
v5: add missing calling of locks_release_private() in nlmsvc_testlock()
v4: only copy fl_lmops for conflock, don't copy fl_ops
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
NFSD or other lockmanager may increase the owner's reference,
so adds two new options for copying and releasing owner.
v5: change order from 2/6 to 3/6
v4: rename lm_copy_owner/lm_release_owner to lm_get_owner/lm_put_owner
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Jeff advice, " Right now __locks_copy_lock is only used to copy
conflocks. It would be good to rename that to something more
distinct (i.e.locks_copy_conflock), to make it clear that we're
generating a conflock there."
v5: change order from 3/6 to 2/6
v4: new patch only renaming function name
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
This argument is always NULL so don't pass it around.
[jlayton: remove dependencies on previous patches in series]
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
The argument to locks_unlink_lock can't be just any pointer to a
pointer. It must be a pointer to the fl_next field in the previous
lock in the list.
Cc: <stable@vger.kernel.org> # v3.15+
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Empty files and missing xattrs do not guarantee that a file was
just created. This patch passes FILE_CREATED flag to IMA to
reliably identify new files.
Signed-off-by: Dmitry Kasatkin <d.kasatkin@samsung.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> 3.14+
rbpp is always passed into xfs_rtmodify_summary
and xfs_rtget_summary, so there is no need to
test for it in xfs_rtmodify_summary_int.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_rtmodify_summary and xfs_rtget_summary are almost identical;
fold them into xfs_rtmodify_summary_int(), with wrappers for each of
the original calls.
The _int function modifies if a delta is passed, and returns a
summary pointer if *sum is passed.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_dir_canenter and xfs_dir_createname are
almost identical.
Fold the former into the latter, with a helpful
wrapper for the former. If createname is called without
an inode number, it now only checks for space, and does
not actually add the entry.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Move the resblks test out of the xfs_dir_canenter,
and into the caller.
This makes a little more sense on the face of it;
xfs_dir_canenter immediately returns if resblks !=0;
and given some of the comments preceding the calls:
* Check for ability to enter directory entry, if no space reserved.
even more so.
It also facilitates the next patch.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In xlog_do_recovery_pass(), there are 2 distinct cases:
non-wrapped and wrapped log recovery.
If we find a wrapped log, we recover around the end
of the log, and then handle the rest of recovery
exactly as in the non-wrapped case - using exactly the same
(duplicated) code.
Rather than having the same code in both cases, we can
get the wrapped portion out of the way first if needed,
and then recover the non-wrapped portion of the log.
There should be no functional change here, just code
reorganization & deduplication.
The patch looks a bit bigger than it really is; the last
hunk is whitespace changes (un-indenting).
Tested with xfstests "check -g log" on a stock configuration.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
For some reason, the older commit:
965c8e5 lseek: the "whence" argument is called "whence"
lseek: the "whence" argument is called "whence"
But the kernel decided to call it "origin" instead.
Fix most of the sites.
left out xfs. So fix xfs.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_seek_hole & xfs_seek_data are remarkably similar;
so much so that they can be combined, saving a fair
bit of semi-complex code duplication.
The following patch passes generic/285 and generic/286,
which specifically test seek behavior.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
XFS log recovery has been discovered to have race conditions with
buffers when I/O errors occur. External tools are available to simulate
I/O errors to XFS, but this alone is not sufficient for testing log
recovery. XFS unconditionally resets the inactive region of the log
prior to log recovery to avoid confusion over processing any partially
written log records that might have been written before an unclean
shutdown. Therefore, unconditional write I/O failures at mount time are
caught by the reset sequence rather than log recovery and hinder the
ability to test the latter.
The device-mapper dm-flakey module uses an up/down timer to define a
cycle for when to fail I/Os. Create a pre log recovery delay tunable
that can be used to coordinate XFS log recovery with I/O errors
simulated by dm-flakey. This facilitates coordination in userspace that
allows the reset of stale log blocks to succeed and writes due to log
recovery to fail. For example, define a dm-flakey instance with an
uptime long enough to allow log reset to succeed and a log recovery
delay long enough to allow the dm-flakey uptime to expire.
The 'log_recovery_delay' sysfs tunable is exported under
/sys/fs/xfs/debug and is only enabled for kernels compiled in XFS debug
mode. The value is exported in units of seconds and allows for a delay
of up to 60 seconds. Note that this is for XFS debug and test
instrumentation purposes only and should not be used by applications. No
delay is enabled by default.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Create a top-level debug directory for global debug sysfs attributes.
This directory is added and removed on XFS module initialization and
removal respectively for DEBUG mode kernels only. It typically resides
at /sys/fs/xfs/debug. It is located at the top level of the xfs sysfs
hierarchy as attributes might define global behavior or behavior that
must be configured before an xfs mount is available (e.g., log recovery
behavior).
Define the global debug kobject that represents the debug sysfs
directory and add generic attribute show/store helpers to support future
attributes. No debug attributes are exported as of yet.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
These were exposed by fsfuzzer runs; without them we fail
in various exciting and sometimes convoluted ways when we
encounter disk corruption.
Without the MAXLEVELS tests we tend to walk off the end of
an array in a loop like this:
for (i = 0; i < cur->bc_nlevels; i++) {
if (cur->bc_bufs[i])
Without the dirblklog test we try to allocate more memory
than we could possibly hope for and loop forever:
xfs_dabuf_map()
nfsb = mp->m_dir_geo->fsbcount;
irecs = kmem_zalloc(sizeof(irec) * nfsb, KM_SLEEP...
As for the logbsize check, that's the convoluted one.
If logbsize is specified at mount time, it's sanitized
in xfs_parseargs; in particular it makes sure that it's
not > XLOG_MAX_RECORD_BSIZE.
If not specified at mount time, it comes from the superblock
via sb_logsunit; this is limited to 256k at mkfs time as well;
it's copied into m_logbsize in xfs_finish_flags().
However, if for some reason the on-disk value is corrupt and
too large, nothing catches it. It's a circuitous path, but
that size eventually finds its way to places that make the kernel
very unhappy, leading to oopses in xlog_pack_data() because we
use the size as an index into iclog->ic_data, but the array
is not necessarily that big.
Anyway - bounds checking when we read from disk is a good thing!
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Workqueues must be explicitly set as freezable to ensure they are frozen
in the assocated part of the hibernation/suspend sequence. Freezing of
workqueues and kernel threads is important to ensure that modifications
are not made on-disk after the hibernation image has been created.
Otherwise, the in-memory state can become inconsistent with what is on
disk and eventually lead to filesystem corruption. We have reports of
free space btree corruptions that occur immediately after restore from
hibernate that suggest the xfs-eofblocks workqueue could be causing
such problems if it races with hibernation.
Mark all of the internal XFS workqueues as freezable to ensure nothing
changes on-disk once the freezer infrastructure freezes kernel threads
and creates the hibernation image.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reported-by: Carlos E. R. <carlos.e.r@opensuse.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Pull ext4 bugfix from Ted Ts'o.
[ Hmm. It's possible we should make kfree() aware of error pointers,
and use IS_ERR_OR_NULL rather than a NULL check. But in the meantime
this is obviously the right fix. - Linus ]
* 'for_linus_urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: avoid trying to kfree an ERR_PTR pointer
Pull nfsd bugfixes from Bruce Fields:
"A couple minor nfsd bugfixes"
* 'for-3.17' of git://linux-nfs.org/~bfields/linux:
lockd: fix rpcbind crash on lockd startup failure
nfsd4: fix rd_dircount enforcement
Btrfs was inserting inodes into the hash table before we had fully
set the inode up on disk. This leaves us open to rare races that allow
two different inodes in memory for the same [root, inode] pair.
This patch fixes things by using insert_inode_locked4 to insert an I_NEW
inode and unlock_new_inode when we're ready for the rest of the kernel
to use the inode.
It also makes sure to init the operations pointers on the inode before
going into the error handling paths.
Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
While we're doing a full fsync (when the inode has the flag
BTRFS_INODE_NEEDS_FULL_SYNC set) that is ranged too (covers only a
portion of the file), we might have ordered operations that are started
before or while we're logging the inode and that fall outside the fsync
range.
Therefore when a full ranged fsync finishes don't remove every extent
map from the list of modified extent maps - as for some of them, that
fall outside our fsync range, their respective ordered operation hasn't
finished yet, meaning the corresponding file extent item wasn't inserted
into the fs/subvol tree yet and therefore we didn't log it, and we must
let the next fast fsync (one that checks only the modified list) see this
extent map and log a matching file extent item to the log btree and wait
for its ordered operation to finish (if it's still ongoing).
A test case for xfstests follows.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The "inherit" in btrfs_ioctl_snap_create_v2() and "vol_args" in
btrfs_ioctl_rm_dev() are ERR_PTRs so we can't call kfree() on them.
These kind of bugs are "One Err Bugs" where there is just one error
label that does everything. I could set the "inherit = NULL" and keep
the single out label but it ends up being more complicated that way. It
makes the code simpler to re-order the unwind so it's in the mirror
order of the allocation and introduce some new error labels.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Commit 3b29970909 "nfsd4: enforce rd_dircount" totally misunderstood
rd_dircount; it refers to total non-attribute bytes returned, not number
of directory entries returned.
Bring the code into agreement with RFC 3530 section 14.2.24.
Cc: stable@vger.kernel.org
Fixes: 3b29970909 "nfsd4: enforce rd_dircount"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
A block_device may be attached to different gendisks and thus
different bdis over time. bdev_inode_switch_bdi() is used to switch
the associated bdi. The function assumes that the inode could be
dirty and transfers it between bdis if so. This is a bit nasty in
that it reaches into bdi internals.
This patch reimplements the function so that it writes out the inode
if dirty. This is a lot simpler and can be implemented without
exposing bdi internals.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
bdev_get_queue() returns the request_queue associated with the
specified block_device. blk_get_backing_dev_info() makes use of
bdev_get_queue() to determine the associated bdi given a block_device.
All the callers of bdev_get_queue() including
blk_get_backing_dev_info() assume that bdev_get_queue() may return
NULL and implement NULL handling; however, bdev_get_queue() requires
the passed in block_device is opened and attached to its gendisk.
Because an active gendisk always has a valid request_queue associated
with it, bdev_get_queue() can never return NULL and neither can
blk_get_backing_dev_info().
Make it clear that neither of the two functions can return NULL and
remove NULL handling from all the callers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Hu (hujianyang <hujianyang@huawei.com>) discovered an issue in the
'empty_log_bytes()' function, which calculates how many bytes are left in the
log:
"
If 'c->lhead_lnum + 1 == c->ltail_lnum' and 'c->lhead_offs == c->leb_size', 'h'
would equalent to 't' and 'empty_log_bytes()' would return 'c->log_bytes'
instead of 0.
"
At this point it is not clear what would be the consequences of this, and
whether this may lead to any problems, but this patch addresses the issue just
in case.
Cc: stable@vger.kernel.org
Tested-by: hujianyang <hujianyang@huawei.com>
Reported-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Hu (hujianyang@huawei.com) discovered a race condition which may lead to a
situation when UBIFS is unable to mount the file-system after an unclean
reboot. The problem is theoretical, though.
In UBIFS, we have the log, which basically a set of LEBs in a certain area. The
log has the tail and the head.
Every time user writes data to the file-system, the UBIFS journal grows, and
the log grows as well, because we append new reference nodes to the head of the
log. So the head moves forward all the time, while the log tail stays at the
same position.
At any time, the UBIFS master node points to the tail of the log. When we mount
the file-system, we scan the log, and we always start from its tail, because
this is where the master node points to. The only occasion when the tail of the
log changes is the commit operation.
The commit operation has 2 phases - "commit start" and "commit end". The former
is relatively short, and does not involve much I/O. During this phase we mostly
just build various in-memory lists of the things which have to be written to
the flash media during "commit end" phase.
During the commit start phase, what we do is we "clean" the log. Indeed, the
commit operation will index all the data in the journal, so the entire journal
"disappears", and therefore the data in the log become unneeded. So we just
move the head of the log to the next LEB, and write the CS node there. This LEB
will be the tail of the new log when the commit operation finishes.
When the "commit start" phase finishes, users may write more data to the
file-system, in parallel with the ongoing "commit end" operation. At this point
the log tail was not changed yet, it is the same as it had been before we
started the commit. The log head keeps moving forward, though.
The commit operation now needs to write the new master node, and the new master
node should point to the new log tail. After this the LEBs between the old log
tail and the new log tail can be unmapped and re-used again.
And here is the possible problem. We do 2 operations: (a) We first update the
log tail position in memory (see 'ubifs_log_end_commit()'). (b) And then we
write the master node (see the big lock of code in 'do_commit()').
But nothing prevents the log head from moving forward between (a) and (b), and
the log head may "wrap" now to the old log tail. And when the "wrap" happens,
the contends of the log tail gets erased. Now a power cut happens and we are in
trouble. We end up with the old master node pointing to the old tail, which was
erased. And replay fails because it expects the master node to point to the
correct log tail at all times.
This patch merges the abovementioned (a) and (b) operations by moving the master
node change code to the 'ubifs_log_end_commit()' function, so that it runs with
the log mutex locked, which will prevent the log from being changed benween
operations (a) and (b).
Cc: stable@vger.kernel.org # 07e19df UBIFS: remove mst_mutex
Cc: stable@vger.kernel.org
Reported-by: hujianyang <hujianyang@huawei.com>
Tested-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Percpu allocator now supports allocation mask. Add @gfp to
percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
with percpu_refs too.
This patch doesn't make any functional difference.
v2: blk-mq conversion was missing. Updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Jens Axboe <axboe@kernel.dk>
Percpu allocator now supports allocation mask. Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.
We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion. This is the one with
the most users. Other percpu data structures are a lot easier to
convert.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: x86@kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
RCU-tasks requires the occasional voluntary context switch
from CPU-bound in-kernel tasks. In some cases, this requires
instrumenting cond_resched(). However, there is some reluctance
to countenance unconditionally instrumenting cond_resched() (see
http://lwn.net/Articles/603252/), so this commit creates a separate
cond_resched_rcu_qs() that may be used in place of cond_resched() in
locations prone to long-duration in-kernel looping.
This commit currently instruments only RCU-tasks. Future possibilities
include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
IPI usage.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pull filesystem fixes from Al Viro:
"Several bugfixes (all of them -stable fodder).
Alexey's one deals with double mutex_lock() in UFS (apparently, nobody
has tried to test "ufs: sb mutex merge + mutex_destroy" on something
like file creation/removal on ufs). Mine deal with two kinds of
umount bugs, in umount propagation and in handling of automounted
submounts, both resulting in bogus transient EBUSY from umount"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ufs: fix deadlocks introduced by sb mutex merge
fix EBUSY on umount() from MNT_SHRINKABLE
get rid of propagate_umount() mistakenly treating slaves as busy.
Commit 0244756edc ("ufs: sb mutex merge + mutex_destroy") introduces
deadlocks in ufs_new_inode() and ufs_free_inode().
Most callers of that functions acqure the mutex by themselves and
ufs_{new,free}_inode() do that via lock_ufs(),
i.e we have an unavoidable double lock.
The patch proposes to resolve the issue by making sure that
ufs_{new,free}_inode() are not called with the mutex held.
Found by Linux Driver Verification project (linuxtesting.org).
Cc: stable@vger.kernel.org # 3.16
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix:
- a direct IO read/buffered read data corruption
- the associated fallout from the DIO data corruption fix
- collapse range bugs that are potential data corruption issues.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJUCkM+AAoJEK3oKUf0dfodUBgP+gJu50/XV4TFRLPlCRxhvN61
371i3ASls1y7ivhj40NzgbDDAZHM2q8Zqwd//318dFViHhWQDlH/1ga06kscRpZX
d8cQEbFHApgUGQL5Gdq2l2hvAzYa75H0m6cq3jveyrN2rscjCSmAXwtlcEmx3AR6
TnCpxuVL5asjEGZYb0KfQACq//rASHJbhukpo1gB4ccZ0boWHOVf5SxuS4remzs9
y+rlPFNl5RD/WVdnJSvu9zu/nP6op3Ax5r7jZanoKbisKHfd7QOa+k65+Vz0Vq9G
kxgfhz+yLfkOvcktq+41e1lVBln7fCIlcO9m3b53uxWPx5cla323893UiGYsA4F/
j/gGlh1qaT6C/1M1JBWDLDx931S78XiR1Y+WbtAU1PO+GuO0IEap3+iqtS2+oNAv
OrpThLOgqTspK6MJeToCzdn2lRT2BJpcKwxIyDK8g+p9N6qCpyw3DfiKyu0wipGH
D2D3mtE6drSHNaSceFAz8CrQvPOR7Ygj92QGpGSfkohxap9h6VJR/wNp/oGnpmN0
qgcxTrHvx3kw1hXssB4gjh6fBDnOUkac0isqxdow22Qt529t9sIanzMBvz+JxHQF
zeqeFSh96lOXmB7UFBU+QyOhbDp3cJWChHrtY3Esw/+FmG6fxEy8z6pdZsiJYELr
5tka2richPD+gyXzcZwP
=tRQo
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-3.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs fixes from Dave Chinner:
"The fixes all address recently discovered data corruption issues.
The original Direct IO issue was discovered by Chris Mason @ Facebook
on a production workload which mixed buffered reads with direct reads
and writes IO to the same file. The fix for that exposed other issues
with page invalidation (exposed by millions of fsx operations) failing
due to dirty buffers beyond EOF.
Finally, the collapse_range code could also cause problems due to
racing writeback changing the extent map while it was being shifted
around. The commits for that problem are simple mitigation fixes that
prevent the problem from occuring. A more robust fix for 3.18 that
addresses the underlying problem is currently being worked on by
Brian.
Summary of fixes:
- a direct IO read/buffered read data corruption
- the associated fallout from the DIO data corruption fix
- collapse range bugs that are potential data corruption issues"
* tag 'xfs-for-linus-3.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: trim eofblocks before collapse range
xfs: xfs_file_collapse_range is delalloc challenged
xfs: don't log inode unless extent shift makes extent modifications
xfs: use ranged writeback and invalidation for direct IO
xfs: don't zero partial page cache pages during O_DIRECT writes
xfs: don't zero partial page cache pages during O_DIRECT writes
xfs: don't dirty buffers beyond EOF
This patch changes sync_filesystem() to be EXPORT_SYMBOL().
The reason this is needed is that starting with 3.15 kernel, due to
Theodore Ts'o's commit 02b9984d64 ("fs: push sync_filesystem() down to
the file system's remount_fs()"), all file systems that have dirty data
to be written out need to call sync_filesystem() from their
->remount_fs() method when remounting read-only.
As this is now a generically required function rather than an internal
only function it should be EXPORT_SYMBOL() so that all file systems can
call it.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull aio bugfixes from Ben LaHaise:
"Two small fixes"
* git://git.kvack.org/~bcrl/aio-fixes:
aio: block exit_aio() until all context requests are completed
aio: add missing smp_rmb() in read_events_ring
It seems that exit_aio() also needs to wait for all iocbs to complete (like
io_destroy), but we missed the wait step in current implemention, so fix
it in the same way as we did in io_destroy.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
Currently udf_iget() (triggered by NFS) can race with udf_new_inode()
leading to two inode structures with the same inode number:
nfsd: iget_locked() creates inode
nfsd: try to read from disk, block on that.
udf_new_inode(): allocate inode with that inumber
udf_new_inode(): insert it into icache, set it up and dirty
udf_write_inode(): write inode into buffer cache
nfsd: get CPU again, look into buffer cache, see nice and sane on-disk
inode, set the in-core inode from it
Fix the problem by putting inode into icache in locked state (I_NEW set)
and unlocking it only after it's fully set up.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
boilerplate code in udf_{create,mknod,symlink} taken to new helper
symlink case converted to unique id calculated by udf_new_inode() - no
point finding a new one.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently UDF doesn't initialize i_generation in any way and thus NFS
can easily get reallocated inodes from stale file handles. Luckily UDF
already has a unique object identifier associated with each inode -
i_unique. Use that for initialization of i_generation.
Signed-off-by: Jan Kara <jack@suse.cz>
NFS can easily ask for inodes that are already deleted. Currently UDF
happily returns such inodes which is a bug. Return -ESTALE if
udf_read_inode() is asked to read deleted inode.
Signed-off-by: Jan Kara <jack@suse.cz>
Currently __udf_read_inode() wasn't returning anything and we found out
whether we succeeded reading inode by checking whether inode is bad or
not. udf_iget() returned NULL on failure and inode pointer otherwise.
Make these two functions properly propagate errors up the call stack and
use the return value in callers.
Signed-off-by: Jan Kara <jack@suse.cz>
We did not implement any bound on number of indirect ICBs we follow when
loading inode. Thus corrupted medium could cause kernel to go into an
infinite loop, possibly causing a stack overflow.
Fix the possible stack overflow by removing recursion from
__udf_read_inode() and limit number of indirect ICBs we follow to avoid
infinite loops.
Signed-off-by: Jan Kara <jack@suse.cz>
There's no good reason to separate these since udf_fill_inode() is
called only from __udf_read_inode() and both do part of the same thing.
Signed-off-by: Jan Kara <jack@suse.cz>
If we are writing back inode of unlinked directory, its link count ends
up being (u16)-1. Although the inode is deleted, udf_iget() can load the
inode when NFS uses stale file handle and get confused.
Signed-off-by: Jan Kara <jack@suse.cz>
Pull f2fs bug fixes from Jaegeuk Kim:
"This series includes patches to:
- fix recovery routines
- fix bugs related to inline_data/xattr
- fix when casting the dentry names
- handle EIO or ENOMEM correctly
- fix memory leak
- fix lock coverage"
* tag 'for-f2fs-3.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (28 commits)
f2fs: reposition unlock_new_inode to prevent accessing invalid inode
f2fs: fix wrong casting for dentry name
f2fs: simplify by using a literal
f2fs: truncate stale block for inline_data
f2fs: use macro for code readability
f2fs: introduce need_do_checkpoint for readability
f2fs: fix incorrect calculation with total/free inode num
f2fs: remove rename and use rename2
f2fs: skip if inline_data was converted already
f2fs: remove rewrite_node_page
f2fs: avoid double lock in truncate_blocks
f2fs: prevent checkpoint during roll-forward
f2fs: add WARN_ON in f2fs_bug_on
f2fs: handle EIO not to break fs consistency
f2fs: check s_dirty under cp_mutex
f2fs: unlock_page when node page is redirtied out
f2fs: introduce f2fs_cp_error for readability
f2fs: give a chance to mount again when encountering errors
f2fs: trigger release_dirty_inode in f2fs_put_super
f2fs: don't skip checkpoint if there is no dirty node pages
...
Thanks to Dan Carpenter for extending smatch to find bugs like this.
(This was found using a development version of smatch.)
Fixes: 36de928641
Reported-by: Dan Carpenter <dan.carpenter@oracle.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
While doing a ranged fsync, that is, one whose range doesn't cover the
whole possible file range (0 to LLONG_MAX), we can crash under certain
circumstances with a trace like the following:
[41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
(...)
[41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
(...)
[41074.643886] RIP: 0010:[<ffffffffa01ecc99>] [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
(...)
[41074.644919] Stack:
(...)
[41074.644919] Call Trace:
[41074.644919] [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
[41074.644919] [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
[41074.644919] [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
[41074.644919] [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
[41074.644919] [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
[41074.644919] [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
[41074.644919] [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
[41074.644919] [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
[41074.644919] [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
[41074.644919] [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
[41074.644919] [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
(...)
The necessary conditions that lead to such crash are:
* an incremental fsync (when the inode doesn't have the
BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
a file extent item ending at offset X;
* the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
to a file truncate operation that reduces the file to a size smaller
than X;
* a ranged fsync call happens (via an msync for example), with a range that
doesn't cover the whole file and the end of this range, lets call it Y, is
smaller than X;
* btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
calls btrfs_truncate_inode_items() to remove all items from the log
tree that are associated with our file;
* btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
file extent item it removed is the one ending at offset X, where X > 0 and
X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
parameter set to X;
* btrfs_ordered_update_i_size() sees that X is greater then the current ordered
size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
ordered operation with a range covering the offset X, calling a BUG_ON() if
such ordered operation exists. This assumption is made because the disk_i_size
is only increased after the corresponding file extent item is added to the
btree (btrfs_finish_ordered_io);
* But because our fsync covers only a limited range, such an ordered extent might
exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered
extent to finish when calling btrfs_wait_ordered_range();
And then by the time btrfs_ordered_update_i_size() is called, via:
btrfs_sync_file() ->
btrfs_log_dentry_safe() ->
btrfs_log_inode_parent() ->
btrfs_log_inode() ->
btrfs_truncate_inode_items() ->
btrfs_ordered_update_i_size()
We hit the BUG_ON(), which could never happen if the fsync range covered the whole
possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to
finish before calling btrfs_truncate_inode_items().
So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items
from a log tree, which isn't supposed to change the in memory inode's disk_i_size.
Issue found while running xfstests/generic/127 (happens very rarely for me), more
specifically via the fsx calls that use memory mapped IO (and issue msync calls).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
While writing to a file, in inode.c:cow_file_range() (and same applies to
submit_compressed_extents()), after reserving an extent for the file data,
we create a new extent map for the written range and insert it into the
extent map cache. After that, we create an ordered operation, but if it
fails (due to a transient/temporary-ENOMEM), we return without dropping
that extent map, which points to a reserved extent that is freed when we
return. A subsequent incremental fsync (when the btrfs inode doesn't have
the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
logs a file extent item based on that extent map, which points to a disk
extent that doesn't contain valid data - it was freed by us earlier, at this
point it might contain any random/garbage data.
Therefore, if we reach an error condition when cowing a file range after
we added the new extent map to the cache, drop it from the cache before
returning.
Some sequence of steps that lead to this:
$ mkfs.btrfs -f /dev/sdd
$ mount -o commit=9999 /dev/sdd /mnt
$ cd /mnt
$ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
$ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
$ sync
$ od -t x1 foo
0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
*
0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
*
0020000
$ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo
# Now this write + fsync fail with -ENOMEM, which was returned by
# btrfs_add_ordered_extent() in inode.c:cow_file_range().
$ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
$ xfs_io -c "fsync" foo
fsync: Cannot allocate memory
# Now do a new write + fsync, which will succeed. Our previous
# -ENOMEM was a transient/temporary error.
$ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
$ xfs_io -c "fsync" foo
# Our file content (in page cache) is now:
$ od -t x1 foo
0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
*
0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
*
0050000
# Now reboot the machine, and mount the fs, so that fsync log replay
# takes place.
# The file content is now weird, in particular the first 8Kb, which
# do not match our data before nor after the sync command above.
$ od -t x1 foo
0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
*
0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
*
0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
*
0050000
# In fact these first 4Kb are a duplicate of the last 4kb block.
# The last write got an extent map/file extent item that points to
# the same disk extent that we got in the write+fsync that failed
# with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
# verify that:
$ btrfs-debug-tree /dev/sdd
(...)
item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
extent data disk byte 12582912 nr 8192
extent data offset 0 nr 8192 ram 8192
item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
extent data disk byte 0 nr 0
extent data offset 0 nr 8192 ram 8192
item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
extent data disk byte 12582912 nr 4096
extent data offset 0 nr 4096 ram 4096
$ umount /dev/sdd
$ btrfsck /dev/sdd
Checking filesystem on /dev/sdd
UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
checking extents
extent item 12582912 has multiple extent items
ref mismatch on [12582912 4096] extent item 1, found 2
Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
backpointer mismatch on [12582912 4096]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
root 5 inode 257 errors 1000, some csum missing
found 131074 bytes used err is 1
total csum bytes: 4
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 123404
file data blocks allocated: 274432
referenced 274432
Btrfs v3.14.1-96-gcc7fd5a-dirty
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
This fixes an Oopsable race when starting up the callback server.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This fixes an Oopsable race when starting lockd.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We ran into a case on ppc64 running mariadb where io_getevents would
return zeroed out I/O events. After adding instrumentation, it became
clear that there was some missing synchronization between reading the
tail pointer and the events themselves. This small patch fixes the
problem in testing.
Thanks to Zach for helping to look into this, and suggesting the fix.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
As the race condition on the inode cache, following scenario can appear:
[Thread a] [Thread b]
->f2fs_mkdir
->f2fs_add_link
->__f2fs_add_link
->init_inode_metadata failed here
->gc_thread_func
->f2fs_gc
->do_garbage_collect
->gc_data_segment
->f2fs_iget
->iget_locked
->wait_on_inode
->unlock_new_inode
->move_data_page
->make_bad_inode
->iput
When we fail in create/symlink/mkdir/mknod/tmpfile, the new allocated inode
should be set as bad to avoid being accessed by other thread. But in above
scenario, it allows f2fs to access the invalid inode before this inode was set
as bad.
This patch fix the potential problem, and this issue was found by code review.
change log from v1:
o Add condition judgment in gc_data_segment() suggested by Changman Lee.
o use iget_failed to simplify code.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
xfs_collapse_file_space() currently writes back the entire file
undergoing collapse range to settle things down for the extent shift
algorithm. While this prevents changes to the extent list during the
collapse operation, the writeback itself is not enough to prevent
unnecessary collapse failures.
The current shift algorithm uses the extent index to iterate the in-core
extent list. If a post-eof delalloc extent persists after the writeback
(e.g., a prior zero range op where the end of the range aligns with eof
can separate the post-eof blocks such that they are not written back and
converted), xfs_bmap_shift_extents() becomes confused over the encoded
br_startblock value and fails the collapse.
As with the full writeback, this is a temporary fix until the algorithm
is improved to cope with a volatile extent list and avoid attempts to
shift post-eof extents.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If we have delalloc extents on a file before we run a collapse range
opertaion, we sync the range that we are going to collapse to
convert delalloc extents in that region to real extents to simplify
the shift operation.
However, the shift operation then assumes that the extent list is
not going to change as it iterates over the extent list moving
things about. Unfortunately, this isn't true because we can't hold
the ILOCK over all the operations. We can prevent new IO from
modifying the extent list by holding the IOLOCK, but that doesn't
prevent writeback from running....
And when writeback runs, it can convert delalloc extents is the
range of the file prior to the region being collapsed, and this
changes the indexes of all the extents in the file. That causes the
collapse range operation to Go Bad.
The right fix is to rewrite the extent shift operation not to be
dependent on the extent list not changing across the entire
operation, but this is a fairly significant piece of work to do.
Hence, as a short-term workaround for the problem, sync the entire
file before starting a collapse operation to remove all delalloc
ranges from the file and so avoid the problem of concurrent
writeback changing the extent list.
Diagnosed-and-Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The file collapse mechanism uses xfs_bmap_shift_extents() to collapse
all subsequent extents down into the specified, previously punched out,
region. This function performs some validation, such as whether a
sufficient hole exists in the target region of the collapse, then shifts
the remaining exents downward.
The exit path of the function currently logs the inode unconditionally.
While we must log the inode (and abort) if an error occurs and the
transaction is dirty, the initial validation paths can generate errors
before the transaction has been dirtied. This creates an unnecessary
filesystem shutdown scenario, as the caller will cancel a transaction
that has been marked dirty.
Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications
are made to the inode bmap. Only log the inode in the exit path if
logflags has been set. This ensures we only have to cancel a dirty
transaction if modifications have been made and prevents an unnecessary
filesystem shutdown otherwise.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Now we are not doing silly things with dirtying buffers beyond EOF
and using invalidation correctly, we can finally reduce the ranges of
writeback and invalidation used by direct IO to match that of the IO
being issued.
Bring the writeback and invalidation ranges back to match the
generic direct IO code - this will greatly reduce the perturbation
of cached data when direct IO and buffered IO are mixed, but still
provide the same buffered vs direct IO coherency behaviour we
currently have.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Similar to direct IO reads, direct IO writes are using
truncate_pagecache_range to invalidate the page cache. This is
incorrect due to the sub-block zeroing in the page cache that
truncate_pagecache_range() triggers.
This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.
cc: stable@vger.kernel.org
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs is using truncate_pagecache_range to invalidate the page cache
during DIO reads. This is different from the other filesystems who
only invalidate pages during DIO writes.
truncate_pagecache_range is meant to be used when we are freeing the
underlying data structs from disk, so it will zero any partial
ranges in the page. This means a DIO read can zero out part of the
page cache page, and it is possible the page will stay in cache.
buffered reads will find an up to date page with zeros instead of
the data actually on disk.
This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.
[dchinner: catch error and warn if it fails. Comment.]
cc: stable@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
generic/263 is failing fsx at this point with a page spanning
EOF that cannot be invalidated. The operations are:
1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)
where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
fails with -EBUSY and so any attempt to do page invalidation fails.
The real question is this: Why can't that page be invalidated after
it has been written to disk and cleaned?
Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
what?
OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.
This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.
Note: the solutions the other filesystems and generic block code use
of marking the buffers clean in ->writepage does not work for XFS.
It still leaves dirty buffers beyond EOF and invalidations still
fail. Hence rather than play whack-a-mole, this patch simply
prevents those buffers from being dirtied in the first place.
cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUAmbRAAoJEAAOaEEZVoIVah4P/iupUO7Ae5ODMDMog/vOp+SM
+sWnyqnEyeMlQlNDoHoef5TPQ28aKEAq1Sg7CsqlK3qZSYSSPhb4KFsGWLZe6D5A
7iWSMKabdnuQ3qBCsb2Y6ZdB8IRAJz81sIAVI8F32NDmSs325wN/coVwfV4g8mQF
QSpv78TjwBY0qNhNw06pS/FLV45IaPTDDgnTHRcOLrfHajDdGTdqrKI/L0ES1PFB
0ZUtG3qMPS2XYRyS6ZQ0TZZrl2/HMA5/fOwqKspNKxYxKS+TOf/umKwPPjHBnMHo
mfD1XnG64ECkNio9bpg2CkjUqaT8aloNPgDxuP15vEV6bZ5WBLKjGUOY2IvPa198
do8CaAdp2Ql6kE2IyD+G+IkjqcxZ9H8hNH3cBM+3TzvxYqaiZKKZAky5UTau2LvG
E5cyWhDPsVBGvAJXEPBf4vhIgzhaSuNox0+73nL4xU+L7bPTDzYIYyhS/InKO0X+
ZwAZn2u62XQmUDI8b+zrgOAHfWB0hHlcIfIsIrDxotM24TPPbJ2k2Dz0hKCZAraR
DYDYPJZg+/QyPc8bujL5Hwjh8MogdDt1vxd9B65MwQWRn791LSGbq6VSsUnsMoAa
dhG5U+a5eI8oQ2gkMEEK45o2ljcnZ3BSim6SGdmZ6YrNyEdk63xA4GjozK1YhWzG
tLkXb4/7zV/dR8VTOQR/
=Wb1t
-----END PGP SIGNATURE-----
Merge tag 'locks-v3.17-3' of git://git.samba.org/jlayton/linux
Pull file locking bugfx from Jeff Layton:
"Just a bugfix for a bug that crept in to v3.15. It's in a rather rare
error path, and I'm not aware of anyone having hit it, but it's worth
fixing for v3.17"
* tag 'locks-v3.17-3' of git://git.samba.org/jlayton/linux:
locks: pass correct "before" pointer to locks_unlink_lock in generic_add_lease
We need the parents of victims alive until namespace_unlock() gets to
dput() of the (ex-)mountpoints. However, that screws up the "is it
busy" checks in case when we have shrinkable mounts that need to be
killed. Solution: go ahead and decrement refcounts of parents right
in umount_tree(), increment them again just before dropping rwsem in
namespace_unlock() (and let the loop in the end of namespace_unlock()
finally drop those references for good, as we do now). Parents can't
get freed until we drop rwsem - at least one reference is kept until
then, both in case when parent is among the victims and when it is
not. So they'll still be around when we get to namespace_unlock().
Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The check in __propagate_umount() ("has somebody explicitly mounted
something on that slave?") is done *before* taking the already doomed
victims out of the child lists.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge patches from Andrew Morton:
"22 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
kexec: purgatory: add clean-up for purgatory directory
Documentation/kdump/kdump.txt: add ARM description
flush_icache_range: export symbol to fix build errors
tools: selftests: fix build issue with make kselftests target
ocfs2: quorum: add a log for node not fenced
ocfs2: o2net: set tcp user timeout to max value
ocfs2: o2net: don't shutdown connection when idle timeout
ocfs2: do not write error flag to user structure we cannot copy from/to
x86/purgatory: use approprate -m64/-32 build flag for arch/x86/purgatory
drivers/rtc/rtc-s5m.c: re-add support for devices without irq specified
xattr: fix check for simultaneous glibc header inclusion
kexec: remove CONFIG_KEXEC dependency on crypto
kexec: create a new config option CONFIG_KEXEC_FILE for new syscall
x86,mm: fix pte_special versus pte_numa
hugetlb_cgroup: use lockdep_assert_held rather than spin_is_locked
mm/zpool: use prefixed module loading
zram: fix incorrect stat with failed_reads
lib: turn CONFIG_STACKTRACE into an actual option.
mm: actually clear pmd_numa before invalidating
memblock, memhotplug: fix wrong type in memblock_find_in_range_node().
...
For debug use, we can see from the log whether the fence decision is
made and why it is not fenced.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When tcp retransmit timeout(15mins), the connection will be closed.
Pending messages may be lost during this time. So we set tcp user
timeout to override the retransmit timeout to the max value. This is OK
for ocfs2 since we have disk heartbeat, if peer crash, the disk
heartbeat will timeout and it will be evicted, if disk heartbeat not
timeout and connection idle for a long time, then this means the cluster
enters split-brain state, since fence can't happen, we'd better keep the
connection and wait network recover.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch series is to fix a possible message lost bug in ocfs2 when
network go bad. This bug will cause ocfs2 hung forever even network
become good again.
The messages may lost in this case. After the tcp connection is
established between two nodes, an idle timer will be set to check its
state periodically, if no messages are received during this time, idle
timer will timeout, it will shutdown the connection and try to
reconnect, so pending messages in tcp queues will be lost. This
messages may be from dlm. Dlm may get hung in this case. This may
cause the whole ocfs2 cluster hung.
This is very possible to happen when network state goes bad. Do the
reconnect is useless, it will fail if network state is still bad. Just
waiting there for network recovering may be a good idea, it will not
lost messages and some node will be fenced until cluster goes into
split-brain state, for this case, Tcp user timeout is used to override
the tcp retransmit timeout. It will timeout after 25 days, user should
have notice this through the provided log and fix the network, if they
don't, ocfs2 will fall back to original reconnect way.
This patch (of 3):
Some messages in the tcp queue maybe lost if we shutdown the connection
and reconnect when idle timeout. If packets lost and reconnect success,
then the ocfs2 cluster maybe hung.
To fix this, we can leave the connection there and do the fence decision
when idle timeout, if network recover before fence dicision is made, the
connection survive without lost any messages.
This bug can be saw when network state go bad. It may cause ocfs2 hung
forever if some packets lost. With this fix, ocfs2 will recover from
hung if network becomes good again.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we failed to copy from the structure, writing back the flags leaks 31
bits of kernel memory (the rest of the ir_flags field).
In any case, if we cannot copy from/to the structure, why should we
expect putting just the flags to work?
Also make sure ocfs2_info_handle_freeinode() returns the right error
code if the copy_to_user() fails.
Fixes: ddee5cdb70 ('Ocfs2: Add new OCFS2_IOC_INFO ioctl for ocfs2 v8.')
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Highlights:
- NFSv3 stable fix for another POSIX ACL regression
- NFSv4 stable fix for a regression with OPEN_DOWNGRADE
- NFSv4 stable fix for bad close() behaviour when holding a delegation
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUANWVAAoJEGcL54qWCgDygyQQALF755JgqEjVy+uRjmqoXn/q
4Gc7fGEkSfGbP8BO0dJ9Qs4IX7WQTFNUw1x3xqR+wBrvjCFbQeMclI2XIBwYarBl
7zNnBK9NpmnNh94cataR9WANTmHMm+3xSA3UmK7OnovFOSDviJpKMa3AkRIrJMX5
ZKYvwN2CigcIefYwtQ2NqkDt0CGbt53zYavQ4hp+//LexaN5z0f2krVj8pPwquYZ
3JX1sm+C6bKxyTbAyJ8cWCWnJ/gxDOzl2ZPjtWah4G3tVpO6CF5+07xbQ8+B5KOc
Bm434dWJFlCYSXvmRgbC9i7d7mJU2+fI0rcUP2LDeA73oKDjndsmqmtq08hmPz5K
FfIA7gko4SJXvYzNKyuoS8j5r+LCtEqKoCCwMucVRwy33rpinmlzw68WTsUm8YtK
0qYDeAqeuCc9ZerGMMFfkmgigAd2cWhhUnL+V5tlpCEeFRnL1+jqnRxuBhLlzgN3
SaikZfmncB6gNR6cGwMfceo1E2AoA1GuVy0am1yPsYMhRF6OPCxaLRR53WgioXrt
DwKUqhQtcE0qN1MII44x0Yxl0oFMTTCl279exnjyCWMpGYX/SAI9ErOpsc+QpIxJ
wQEL+xUHOV7B6gTt4Y6GbXtL7toBLcMmT71gz6OHcTNJN0OtMuqtp9jgy1iTG7Gm
2Gd5Su14xza8DFEaDLca
=3xjW
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.17-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Highlights:
- NFSv3 stable fix for another POSIX ACL regression
- NFSv4 stable fix for a regression with OPEN_DOWNGRADE
- NFSv4 stable fix for bad close() behaviour when holding a delegation"
* tag 'nfs-for-3.17-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv3: Fix another acl regression
NFSv4: Don't clear the open state when we just did an OPEN_DOWNGRADE
NFSv4: Fix problems with close in the presence of a delegation
allocation failures, and to fix some journaling bugs involving journal
checksums and FALLOC_FL_ZERO_RANGE.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJT/+hGAAoJENNvdpvBGATwlU8P/02752nzboRRtqYZBxh/rP6L
QoawhKslb516QFcwgxBpmhf/uSg0XIIakANEFSvlJksj0hcSLNmHl3SjGB6EGyu0
1qOjgXSULFFnLGkjJ9ptzn266irQRR2AX5+mBP1T/JV6L5dRFwylCWbSElxEjobt
WhUe0TzXjazYviItOugh8tQYKrfWlfc0UnMSOU7abastStYkROPuvUUOg0fcQCW/
jZpgFQDKO+TmIZ/QtP26Bogz27Cthe5d1XnA9555JOyYjxpRh3HnVaZXLXOtA2nf
eQZmDpfXCnbqORLsqDQbq1+TMMFVjudQyIgHkmMojshTc2PWGZyl/KtgxDHCoBxz
j3a/qafUPbkqEKTLOunDggkWvOKhah7Z6ZCxzamC3d5Cy2GtjUhhp+iyllf4Tmga
OEWIPp/5F3/UfJj/0e3fcmj8tzTP8bOgVh4xC/Iwf3wugKzeGs9iaWEs02TJpCQk
Yu+xqhHP05MGQuMXcQbPJy+DPq3a43Y/PBlzyF9ZmvJKqs0SxRIhgDnpRXDLla/m
a2zYkzqZBog081idgy1KSJjL1XVBjHkcMwUaZ/mOCd/ok7pAhQIYPVeJEpBylf+l
ABtTgn8qU6+QmkaeTypY8h3OAMES/PgA+PQp46zgmPJVopov0926QuMWzXb2Wbq9
ZFGJziWdAZos5XWnMo6A
=e6Ou
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
"Ext4 bug fixes for 3.17, to provide better handling of memory
allocation failures, and to fix some journaling bugs involving
journal checksums and FALLOC_FL_ZERO_RANGE"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix same-dir rename when inline data directory overflows
jbd2: fix descriptor block size handling errors with journal_csum
jbd2: fix infinite loop when recovering corrupt journal blocks
ext4: update i_disksize coherently with block allocation on error path
ext4: fix transaction issues for ext4_fallocate and ext_zero_range
ext4: fix incorect journal credits reservation in ext4_zero_range
ext4: move i_size,i_disksize update routines to helper function
ext4: fix BUG_ON in mb_free_blocks()
ext4: propagate errors up to ext4_find_entry()'s callers
The dentry name type is unsigned char *.
If we don't match this type, some character codes can be changed by signed bit.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When performing a same-directory rename, it's possible that adding or
setting the new directory entry will cause the directory to overflow
the inline data area, which causes the directory to be converted to an
extent-based directory. Under this circumstance it is necessary to
re-read the directory when deleting the old dirent because the "old
directory" context still points to i_block in the inode table, which
is now an extent tree root! The delete fails with an FS error, and
the subsequent fsck complains about incorrect link counts and
hardlinked directories.
Test case (originally found with flat_dir_test in the metadata_csum
test program):
# mkfs.ext4 -O inline_data /dev/sda
# mount /dev/sda /mnt
# mkdir /mnt/x
# touch /mnt/x/changelog.gz /mnt/x/copyright /mnt/x/README.Debian
# sync
# for i in /mnt/x/*; do mv $i $i.longer; done
# ls -la /mnt/x/
total 0
-rw-r--r-- 1 root root 0 Aug 25 12:03 changelog.gz.longer
-rw-r--r-- 1 root root 0 Aug 25 12:03 copyright
-rw-r--r-- 1 root root 0 Aug 25 12:03 copyright.longer
-rw-r--r-- 1 root root 0 Aug 25 12:03 README.Debian.longer
(Hey! Why are there four files now??)
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
It turns out that there are some serious problems with the on-disk
format of journal checksum v2. The foremost is that the function to
calculate descriptor tag size returns sizes that are too big. This
causes alignment issues on some architectures and is compounded by the
fact that some parts of jbd2 use the structure size (incorrectly) to
determine the presence of a 64bit journal instead of checking the
feature flags.
Therefore, introduce journal checksum v3, which enlarges the
descriptor block tag format to allow for full 32-bit checksums of
journal blocks, fix the journal tag function to return the correct
sizes, and fix the jbd2 recovery code to use feature flags to
determine 64bitness.
Add a few function helpers so we don't have to open-code quite so
many pieces.
Switching to a 16-byte block size was found to increase journal size
overhead by a maximum of 0.1%, to convert a 32-bit journal with no
checksumming to a 32-bit journal with checksum v3 enabled.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: TR Reardon <thomas_reardon@hotmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
When recovering the journal, don't fall into an infinite loop if we
encounter a corrupt journal block. Instead, just skip the block and
return an error, which fails the mount and thus forces the user to run
a full filesystem fsck.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
In case of delalloc block i_disksize may be less than i_size. So we
have to update i_disksize each time we allocated and submitted some
blocks beyond i_disksize. We weren't doing this on the error paths,
so fix this.
testcase: xfstest generic/019
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
The working group appears committed to keeping the protocol stable, the
code has gotten some use and seems to work OK.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Recent NFS v4.2 drafts have removed NFS4ERR_METADATA_NOTSUPP and
reassigned the error code to NFS4ERR_UNION_NOTSUPP.
I also add in the NFS4ERR_OFFLOAD_NO_REQS error code.
We're not using any of these yet, so there's no harm done.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
locks_alloc_lock() has initialized struct file_lock, no need to
re-initialize it here.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We can make the code a bit simpler because we know that "!retry" is
zero.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After commit f282ac19d8 we use different transactions for
preallocation and i_disksize update which result in complain from fsck
after power-failure. spotted by generic/019. IMHO this is regression
because fs becomes inconsistent, even more 'e2fsck -p' will no longer
works (which drives admins go crazy) Same transaction requirement
applies ctime,mtime updates
testcase: xfstest generic/019
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Currently we reserve only 4 blocks but in worst case scenario
ext4_zero_partial_blocks() may want to zeroout and convert two
non adjacent blocks.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Pull btrfs fixes from Chris Mason:
"The biggest of these comes from Liu Bo, who tracked down a hang we've
been hitting since moving to kernel workqueues (it's a btrfs bug, not
in the generic code). His patch needs backporting to 3.16 and 3.15
stable, which I'll send once this is in.
Otherwise these are assorted fixes. Most were integrated last week
during KS, but I wanted to give everyone the chance to test the
result, so I waited for rc2 to come out before sending"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
Btrfs: fix task hang under heavy compressed write
Btrfs: fix filemap_flush call in btrfs_file_release
Btrfs: fix crash on endio of reading corrupted block
btrfs: fix leak in qgroup_subtree_accounting() error path
btrfs: Use right extent length when inserting overlap extent map.
Btrfs: clone, don't create invalid hole extent map
Btrfs: don't monopolize a core when evicting inode
Btrfs: fix hole detection during file fsync
Btrfs: ensure tmpfile inode is always persisted with link count of 0
Btrfs: race free update of commit root for ro snapshots
Btrfs: fix regression of btrfs device replace
Btrfs: don't consider the missing device when allocating new chunks
Btrfs: Fix wrong device size when we are resizing the device
Btrfs: don't write any data into a readonly device when scrub
Btrfs: Fix the problem that the replace destroys the seed filesystem
btrfs: Return right extent when fiemap gives unaligned offset and len.
Btrfs: fix wrong extent mapping for DirectIO
Btrfs: fix wrong write range for filemap_fdatawrite_range()
Btrfs: fix wrong missing device counter decrease
Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs
...
The autodefrag code skips defrag when two extents are adjacent. But one
big advantage for autodefrag is cutting down on the number of small
extents, even when they are adjacent. This commit changes it to defrag
all small extents.
Signed-off-by: Chris Mason <clm@fb.com>
I've been seeing issues with disposing cookies under vma pressure. The symptom
is that the refcount gets out of sync. In this case we fail to decrement the
refcount if submit fails. I found this while auditing the error in and around
cookie operations.
Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: David Howells <dhowells@redhat.com>
We would have returned -EINVAL earlier if ticks wasn't set.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Link: http://lkml.kernel.org/r/20140801082848.GF28869@mwanda
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When creating a new object on the NFS server, we should not be sending
posix setacl requests unless the preceding posix_acl_create returned a
non-trivial acl. Doing so, causes Solaris servers in particular to
return an EINVAL.
Fixes: 013cdf1088 (nfs: use generic posix ACL infrastructure,,,)
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1132786
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we did an OPEN_DOWNGRADE, then the right thing to do on success, is
to apply the new open mode to the struct nfs4_state. Instead, we were
unconditionally clearing the state, making it appear to our state
machinery as if we had just performed a CLOSE.
Fixes: 226056c5c3 (NFSv4: Use correct locking when updating nfs4_state...)
Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In the presence of delegations, we can no longer assume that the
state->n_rdwr, state->n_rdonly, state->n_wronly reflect the open
stateid share mode, and so we need to calculate the initial value
for calldata->arg.fmode using the state->flags.
Reported-by: James Drews <drews@engr.wisc.edu>
Fixes: 88069f77e1 (NFSv41: Fix a potential state leakage when...)
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
__this_cpu_ptr is being phased out use raw_cpu_ptr instead which was
introduced in 3.15-rc1.
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Highlights:
- More fixes for read/write codepath regressions
- Sleeping while holding the inode lock
- Stricter enforcement of page contiguity when coalescing requests
- Fix up error handling in the page coalescing code
- Don't busy wait on SIGKILL in the file locking code
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJT+0LpAAoJEGcL54qWCgDyWfsP/imrpge47aZywi95chV8vgjM
O85ITZbupTFwXbB7kE63CrcaxRGhFrSStk4UDhDCDkHfFb1ksjZaPR1mnkwvkR2p
4+JUoq0fkPfeX21+rqKCYmnhstpne/N8K8FJBsEs3/TqiCBWxWOelLXdyWun4H5B
9JBYQ7FYitUazeSiSiDXcl7Di/E09cFPi0H5VPKRyuNdYxySabnsBOELBE/28iXr
egW1I9UKQR2EtBrvgazBbWE5XmB9XAm4X3sD1l0QD65mfSNkbnNhPFSiCdT7f/d6
9uxECR0Y4wNYgYAfVLBew5/MXJajcv03BFMKmTUeGj9fOQzycpBT4Dx2KxEWqfnt
Xk2nNbISxBnO0koMflmo+LPv2lv+Br3kQ+eZCHHKknvBrX2a6bJdTCZkwACVtND9
LdbAveFQpdaeLrm/28TnRoE927r+VeAVM19yOSG8sNAskFFg4Yy51tR0e1GivkJT
+qmmTRx+l78HjHvoPXOYdNgBC954r6APH5ST7su/7WxNClM36fEK6XxA9xbDLJWm
wUzlGKvpwEeBJJhgjbQLwuU8BiksjFz/CaiObNvPOpc/d2GoKIhnTg19kNhg2R//
UCDa2d5fep4z0Bo9p0s1KZm9pSBkkLjvRp9dm8WEIxLcdaF1jBK3dJECepm6ccvw
dmEmEfjbMudVdt/ZhapJ
=2wRt
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Highlights:
- more fixes for read/write codepath regressions
* sleeping while holding the inode lock
* stricter enforcement of page contiguity when coalescing requests
* fix up error handling in the page coalescing code
- don't busy wait on SIGKILL in the file locking code"
* tag 'nfs-for-3.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: Don't busy-wait on SIGKILL in __nfs_iocounter_wait
nfs: can_coalesce_requests must enforce contiguity
nfs: disallow duplicate pages in pgio page vectors
nfs: don't sleep with inode lock in lock_and_join_requests
nfs: fix error handling in lock_and_join_requests
nfs: use blocking page_group_lock in add_request
nfs: fix nonblocking calls to nfs_page_group_lock
nfs: change nfs_page_group_lock argument
Clarify descriptions of SMB2 and SMB3 support in Kconfig
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
This verifies to truncate any allocated blocks, offset[0], by inline_data.
Not figured out, but for making sure.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The existing code uses the old MAX_NAME constant. This causes
XFS test generic/013 to fail. Fix it by replacing MAX_NAME with
PATH_MAX that SMB1 uses. Also remove an unused MAX_NAME constant
definition.
Cc: <stable@vger.kernel.org> # v3.7+
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
The existing code calls server->ops->close() that is not
right. This causes XFS test generic/310 to fail. Fix this
by using server->ops->closedir() function.
Cc: <stable@vger.kernel.org> # v3.7+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
As reported by Dan Aloni, commit f8567a3845 ("aio: fix aio request
leak when events are reaped by userspace") introduces a regression when
user code attempts to perform io_submit() with more events than are
available in the ring buffer. Reverting that commit would reintroduce a
regression when user space event reaping is used.
Fixing this bug is a bit more involved than the previous attempts to fix
this regression. Since we do not have a single point at which we can
count events as being reaped by user space and io_getevents(), we have
to track event completion by looking at the number of events left in the
event ring. So long as there are as many events in the ring buffer as
there have been completion events generate, we cannot call
put_reqs_available(). The code to check for this is now placed in
refill_reqs_available().
A test program from Dan and modified by me for verifying this bug is available
at http://www.kvack.org/~bcrl/20140824-aio_bug.c .
Reported-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: Dan Aloni <dan@kernelim.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: stable@vger.kernel.org # v3.16 and anything that f8567a3845 was backported to
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.
Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --
normal_work_helper(arg)
work = container_of(arg, struct btrfs_work, normal_work);
work->func() <---- (we name it work X)
for ordered_work in wq->ordered_list
ordered_work->ordered_func()
ordered_work->ordered_free()
The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will
file a readahead request
btrfs_readpages()
for page that is not in page cache
__do_readpage()
submit_extent_page()
btrfs_submit_bio_hook()
btrfs_bio_wq_end_io()
submit_bio()
end_workqueue_bio() <--(ret by the 1st endio)
queue a work(named work Y) for the 2nd
also the real endio()
So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.
A bit more explanation,
A,B,C -- struct btrfs_work
arg -- struct work_struct
kthread:
worker_thread()
pick up a work_struct from @worklist
process_one_work(arg)
worker->current_work = arg; <-- arg is A->normal_work
worker->current_func(arg)
normal_work_helper(arg)
A = container_of(arg, struct btrfs_work, normal_work);
A->func()
A->ordered_func()
A->ordered_free() <-- A gets freed
B->ordered_func()
submit_compressed_extents()
find_free_extent()
load_free_space_inode()
... <-- (the above readhead stack)
end_workqueue_bio()
btrfs_queue_work(work C)
B->ordered_free()
As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.
Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).
When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.
So the situation is that our kthread is waiting forever on work C.
Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.
With this patch, I no long hit the above hang.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we suffer a block allocation failure (for example due to a memory
allocation failure), it's possible that we will call
ext4_discard_allocated_blocks() before we've actually allocated any
blocks. In that case, fe_len and fe_start in ac->ac_f_ex will still
be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0)
triggering the BUG_ON on mb_free_blocks():
BUG_ON(last >= (sb->s_blocksize << 3));
Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len
is zero.
Also fix a missing ext4_mb_unload_buddy() call in
ext4_discard_allocated_blocks().
Google-Bug-Id: 16844242
Fixes: 86f0afd463
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
If we run into some kind of error, such as ENOMEM, while calling
ext4_getblk() or ext4_dx_find_entry(), we need to make sure this error
gets propagated up to ext4_find_entry() and then to its callers. This
way, transient errors such as ENOMEM can get propagated to the VFS.
This is important so that the system calls return the appropriate
error, and also so that in the case of ext4_lookup(), we return an
error instead of a NULL inode, since that will result in a negative
dentry cache entry that will stick around long past the OOM condition
which caused a transient ENOMEM error.
Google-Bug-Id: #17142205
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
If a SIGKILL is sent to a task waiting in __nfs_iocounter_wait,
it will busy-wait or soft lockup in its while loop.
nfs_wait_bit_killable won't sleep, and the loop won't exit on
the error return.
Stop the busy-wait by breaking out of the loop when
nfs_wait_bit_killable returns an error.
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Commit 6094f83864
"nfs: allow coalescing of subpage requests" got rid of the requirement
that requests cover whole pages, but it made some incorrect assumptions.
It turns out that callers of this interface can map adjacent requests
(by file position as seen by req_offset + req->wb_bytes) to different pages,
even when they could share a page. An example is the direct I/O interface -
iov_iter_get_pages_alloc may return one segment with a partial page filled
and the next segment (which is adjacent in the file position) starts with a
new page.
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Adjacent requests that share the same page are allowed, but should only
use one entry in the page vector. This avoids overruning the page
vector - it is sized based on how many bytes there are, not by
request count.
This fixes issues that manifest as "Redzone overwritten" bugs (the
vector overrun) and hangs waiting on page read / write, as it waits on
the same page more than once.
This also adds bounds checking to the page vector with a graceful failure
(WARN_ON_ONCE and pgio error returned to application).
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This handles the 'nonblock=false' case in nfs_lock_and_join_requests.
If the group is already locked and blocking is allowed, drop the inode lock
and wait for the group lock to be cleared before trying it all again.
This should fix warnings found in peterz's tree (sched/wait branch), where
might_sleep() checks are added to wait.[ch].
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This fixes handling of errors from nfs_page_group_lock in
nfs_lock_and_join_requests. It now releases the inode lock and the
reference to the head request.
Reported-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
__nfs_pageio_add_request was calling nfs_page_group_lock nonblocking, but
this can return -EAGAIN which would end up passing -EIO to the application.
There is no reason not to block in this path, so change the two calls to
do so. Also, there is no need to check the return value of
nfs_page_group_lock when nonblock=false, so remove the error handling code.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
nfs_page_group_lock was calling wait_on_bit_lock even when told not to
block. Fix by first trying test_and_set_bit, followed by wait_on_bit_lock
if and only if blocking is allowed. Return -EAGAIN if nonblocking and the
test_and_set of the bit was already locked.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Flip the meaning of the second argument from 'wait' to 'nonblock' to
match related functions. Update all five calls to reflect this change.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This patch introduces DEF_NIDS_PER_INODE/GET_ORPHAN_BLOCKS/F2FS_CP_PACKS macro
instead of numbers in code for readability.
change log from v1:
o fix typo pointed out by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The argument to locks_unlink_lock can't be just any pointer to a
pointer. It must be a pointer to the fl_next field in the previous
lock in the list.
Cc: <stable@vger.kernel.org> # v3.15+
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
CIFS servers process nlink counts differently for files and directories.
In cifs_rename() if we the request fails on the existing target, we
try to remove it through cifs_unlink() but this is not what we want
to do for directories. As the result the following sequence of commands
mkdir {1,2}; mv -T 1 2; rmdir {1,2}; mkdir {1,2}; echo foo > 2/bar
and XFS test generic/023 fail with -ENOENT error. That's why the second
mkdir reuses the existing inode (target inode of the mv -T command) with
S_DEAD flag.
Fix this by checking whether the target is directory or not and
calling cifs_rmdir() rather than cifs_unlink() for directories.
Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
There is no need to explicitly send SIGKILL to cifs_demultiplex_thread
as it is calling module_put_and_exit to exit cleanly.
socket sk_rcvtimeo is set to 7 HZ so the thread will wake up in 7 seconds and
clean itself.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Acked-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Currently cifs have all or nothing approach for directIO operations.
cache=strict mode does not allow directIO while cache=none mode performs
all the operations as directIO even when user does not specify O_DIRECT
flag. This patch enables strict cache mode to honour directIO semantics.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Steve French <smfrench@gmail.com>
This patch introduce need_do_checkpoint() to include numerous judgment condition
for readability.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Theoretically, our total inodes number is the same as total node number, but
there are three node ids are reserved in f2fs, they are 0, 1 (node nid), and 2
(meta nid), and they should never be used by user, so our total/free inode
number calculated in ->statfs is wrong.
This patch indroduces F2FS_RESERVED_NODE_NUM and then fixes this issue by
recalculating total/free inode number with the macro.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch checks inline_data one more time under the inode page lock whether
its inline_data is converted or not.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
I think we need to let the dirty node pages remain in the page cache instead
of rewriting them in their places.
So, after done with successful recovery, write_checkpoint will flush all of them
through the normal write path.
Through this, we can avoid potential error cases in terms of block allocation.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The init_inode_metadata calls truncate_blocks when error is occurred.
The callers holds f2fs_lock_op, so we should not call it again in
truncate_blocks.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Any checkpoint should not be done during the core roll-forward procedure.
Especially, it includes error cases too.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are two rules when EIO is occurred.
1. don't write any checkpoint data to preserve the previous checkpoint
2. don't lose the cached dentry/node/meta pages
So, at first, this patch adds set_page_dirty in f2fs_write_end_io's failure.
Then, writing checkpoint/dentry/node blocks is not allowed.
Note that, for the data pages, we can't just throw away by redirtying them.
Otherwise, kworker can fall into infinite loop to flush them.
(Ref. xfstests/019)
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In case of error, goto ssetup_exit can be hit and we could end up using
uninitialized value of resp_buftype
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Unlikely but possible. When password is supplied multiple times, we have
to free the previous allocation.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Steve French <smfrench@gmail.com>
When kzalloc fails, we will end up doing NULL pointer derefrence
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Steve French <smfrench@gmail.com>
It needs to check s_dirty under cp_mutex, since s_dirty is reset under that
mutex.
And previous condition was not correct, since we can omit doing checkpoint
when checkpoint was done followed by all the node pages were written back.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch gives another chance to try mount process when we encounter an error.
This makes an effect on the roll-forward recovery failures as well.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The generic_shutdown_super calls sync_filesystem, evict_inode, and then
f2fs_put_super. In f2fs_evict_inode, we remain some dirty inode information
so we should release them at f2fs_put_super.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should only be flushing on close if the file was flagged as needing
it during truncate. I broke this with my ordered data vs transaction
commit deadlock fix.
Thanks to Miao Xie for catching this.
Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
The crash is
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:2124!
[...]
Workqueue: btrfs-endio normal_work_helper [btrfs]
RIP: 0010:[<ffffffffa02d6055>] [<ffffffffa02d6055>] end_bio_extent_readpage+0xb45/0xcd0 [btrfs]
This is in fact a regression.
It is because we forgot to increase @offset properly in reading corrupted block,
so that the @offset remains, and this leads to checksum errors while reading
left blocks queued up in the same bio, and then ends up with hiting the above
BUG_ON.
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Coverity pointed this out; in the newly added
qgroup_subtree_accounting(), if btrfs_find_all_roots()
returns an error, we leak at least the parents pointer,
and possibly the roots pointer, depending on what failure
occurs.
If btrfs_find_all_roots() returns an error, we need to
free up all allocations before we return. "roots" is
initialized to NULL, so it should be safe to free
it unconditionally (ulist_free() handles that case).
Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
When current btrfs finds that a new extent map is going to be insereted
but failed with -EEXIST, it will try again to insert the extent map
but with the length of sectorsize.
This is OK if we don't enable 'no-holes' feature since all extent space
is continuous, we will not go into the not found->insert routine.
But if we enable 'no-holes' feature, it will make things out of control.
e.g. in 4K sectorsize, we pass the following args to btrfs_get_extent():
btrfs_get_extent() args: start: 27874 len 4100
28672 27874 28672 27874+4100 32768
|-----------------------|
|---------hole--------------------|---------data----------|
1) not found and insert
Since no extent map containing the range, btrfs_get_extent() will go
into the not_found and insert routine, which will try to insert the
extent map (27874, 27847 + 4100).
2) first overlap
But it overlaps with (28672, 32768) extent, so -EEXIST will be returned
by add_extent_mapping().
3) retry but still overlap
After catching the -EEXIST, then btrfs_get_extent() will try insert it
again but with 4K length, which still overlaps, so -EEXIST will be
returned.
This makes the following patch fail to punch hole.
d77815461f btrfs: Avoid trucating page or punching hole in a already existed hole.
This patch will use the right length, which is the (exsisting->start -
em->start) to insert, making the above patch works in 'no-holes' mode.
Also, some small code style problems in above patch is fixed too.
Reported-by: Filipe David Manana <fdmanana@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe David Manana <fdmanana@suse.com>
Tested-by: Filipe David Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
When cloning a file that consists of an inline extent, we were creating
an extent map that represents a non-existing trailing hole starting at a
file offset that isn't a multiple of the sector size. This happened because
when processing an inline extent we weren't aligning the extent's length to
the sector size, and therefore incorrectly treating the range
[inline_extent_length; sector_size[ as a hole.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
If an inode has a very large number of extent maps, we can spend
a lot of time freeing them, which triggers a soft lockup warning.
Therefore reschedule if we need to when freeing the extent maps
while evicting the inode.
I could trigger this all the time by running xfstests/generic/299 on
a file system with the no-holes feature enabled. That test creates
an inode with 11386677 extent maps.
$ mkfs.btrfs -f -O no-holes $TEST_DEV
$ MKFS_OPTIONS="-O no-holes" ./check generic/299
generic/299 382s ...
Message from syslogd@debian-vm3 at Aug 7 10:44:29 ...
kernel:[85304.208017] BUG: soft lockup - CPU#0 stuck for 22s! [umount:25330]
384s
Ran: generic/299
Passed all 1 tests
$ dmesg
(...)
[86304.300017] BUG: soft lockup - CPU#0 stuck for 23s! [umount:25330]
(...)
[86304.300036] Call Trace:
[86304.300036] [<ffffffff81698ba9>] __slab_free+0x54/0x295
[86304.300036] [<ffffffffa02ee9cc>] ? free_extent_map+0x5c/0xb0 [btrfs]
[86304.300036] [<ffffffff811a6cd2>] kmem_cache_free+0x282/0x2a0
[86304.300036] [<ffffffffa02ee9cc>] free_extent_map+0x5c/0xb0 [btrfs]
[86304.300036] [<ffffffffa02e3775>] btrfs_evict_inode+0xd5/0x660 [btrfs]
[86304.300036] [<ffffffff811e7c8d>] ? __inode_wait_for_writeback+0x6d/0xc0
[86304.300036] [<ffffffff816a389b>] ? _raw_spin_unlock+0x2b/0x40
[86304.300036] [<ffffffff811d8cbb>] evict+0xab/0x180
[86304.300036] [<ffffffff811d8dce>] dispose_list+0x3e/0x60
[86304.300036] [<ffffffff811d9b04>] evict_inodes+0xf4/0x110
[86304.300036] [<ffffffff811bd953>] generic_shutdown_super+0x53/0x110
[86304.300036] [<ffffffff811bdaa6>] kill_anon_super+0x16/0x30
[86304.300036] [<ffffffffa02a78ba>] btrfs_kill_super+0x1a/0xa0 [btrfs]
[86304.300036] [<ffffffff811bd3a9>] deactivate_locked_super+0x59/0x80
[86304.300036] [<ffffffff811be44e>] deactivate_super+0x4e/0x70
[86304.300036] [<ffffffff811dec14>] mntput_no_expire+0x174/0x1f0
[86304.300036] [<ffffffff811deab7>] ? mntput_no_expire+0x17/0x1f0
[86304.300036] [<ffffffff811e0517>] SyS_umount+0x97/0x100
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The file hole detection logic during a file fsync wasn't correct,
because it didn't look back (in a previous leaf) for the last file
extent item that can be in a leaf to the left of our leaf and that
has a generation lower than the current transaction id. This made it
assume that a hole exists when it really doesn't exist in the file.
Such false positive hole detection happens in the following scenario:
* We have a file that has many file extent items, covering 3 or more
btree leafs (the first leaf must contain non file extent items too).
* Two ranges of the file are modified, with their extent items being
located at 2 different leafs and those leafs aren't consecutive.
* When processing the second modified leaf, we weren't checking if
some file extent item exists that is located in some leaf that is
between our 2 modified leafs, and therefore assumed the range defined
between the last file extent item in the first leaf and the first file
extent item in the second leaf matched a hole.
Fortunately this didn't result in overriding the log with wrong data,
instead it made the last loop in copy_items() attempt to insert a
duplicated key (for a hole file extent item), which makes the file
fsync code return with -EEXIST to file.c:btrfs_sync_file() which in
turn ends up doing a full transaction commit, which is much more
expensive then writing only to the log tree and wait for it to be
durably persisted (as well as the file's modified extents/pages).
Therefore fix the hole detection logic, so that we don't pay the
cost of doing full transaction commits.
I could trigger this issue with the following test for xfstests (which
never fails, either without or with this patch). The last fsync call
results in a full transaction commit, due to the -EEXIST error mentioned
above. I could also observe this behaviour happening frequently when
running xfstests/generic/075 in a loop.
Test:
_cleanup()
{
_cleanup_flakey
rm -fr $tmp
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_dm_flakey
_need_to_be_root
rm -f $seqres.full
# Create a file with many file extent items, each representing a 4Kb extent.
# These items span 3 btree leaves, of 16Kb each (default mkfs.btrfs leaf size
# as of btrfs-progs 3.12).
_scratch_mkfs -l 16384 >/dev/null 2>&1
_init_flakey
SAVE_MOUNT_OPTIONS="$MOUNT_OPTIONS"
MOUNT_OPTIONS="$MOUNT_OPTIONS -o commit=999"
_mount_flakey
# First fsync, inode has BTRFS_INODE_NEEDS_FULL_SYNC flag set.
$XFS_IO_PROG -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# For any of the following fsync calls, inode doesn't have the flag
# BTRFS_INODE_NEEDS_FULL_SYNC set.
for ((i = 1; i <= 500; i++)); do
OFFSET=$((4096 * i))
LEN=4096
$XFS_IO_PROG -c "pwrite -S 0x01 $OFFSET $LEN" -c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
done
# Commit transaction and bump next transaction's id (to 7).
sync
# Truncate will set the BTRFS_INODE_NEEDS_FULL_SYNC flag in the btrfs's
# inode runtime flags.
$XFS_IO_PROG -c "truncate 2048000" $SCRATCH_MNT/foo
# Commit transaction and bump next transaction's id (to 8).
sync
# Touch 1 extent item from the first leaf and 1 from the last leaf. The leaf
# in the middle, containing only file extent items, isn't touched. So the
# next fsync, when calling btrfs_search_forward(), won't visit that middle
# leaf. First and 3rd leaf have now a generation with value 8, while the
# middle leaf remains with a generation with value 6.
$XFS_IO_PROG \
-c "pwrite -S 0xee -b 4096 0 4096" \
-c "pwrite -S 0xff -b 4096 2043904 4096" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
_load_flakey_table $FLAKEY_DROP_WRITES
md5sum $SCRATCH_MNT/foo | _filter_scratch
_unmount_flakey
_load_flakey_table $FLAKEY_ALLOW_WRITES
# During mount, we'll replay the log created by the fsync above, and the file's
# md5 digest should be the same we got before the unmount.
_mount_flakey
md5sum $SCRATCH_MNT/foo | _filter_scratch
_unmount_flakey
MOUNT_OPTIONS="$SAVE_MOUNT_OPTIONS"
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we open a file with O_TMPFILE, don't do any further operation on
it (so that the inode item isn't updated) and then force a transaction
commit, we get a persisted inode item with a link count of 1, and not 0
as it should be.
Steps to reproduce it (requires a modern xfs_io with -T support):
$ mkfs.btrfs -f /dev/sdd
$ mount -o /dev/sdd /mnt
$ xfs_io -T /mnt &
$ sync
Then btrfs-debug-tree shows the inode item with a link count of 1:
$ btrfs-debug-tree /dev/sdd
(...)
fs tree key (FS_TREE ROOT_ITEM 0)
leaf 29556736 items 4 free space 15851 generation 6 owner 5
fs uuid f164d01b-1b92-481d-a4e4-435fb0f843d0
chunk uuid 0e3d0e56-bcca-4a1c-aa5f-cec2c6f4f7a6
item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
inode generation 3 transid 6 size 0 block group 0 mode 40755 links 1
item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
inode ref index 0 namelen 2 name: ..
item 2 key (257 INODE_ITEM 0) itemoff 15951 itemsize 160
inode generation 6 transid 6 size 0 block group 0 mode 100600 links 1
item 3 key (ORPHAN ORPHAN_ITEM 257) itemoff 15951 itemsize 0
orphan item
checksum tree key (CSUM_TREE ROOT_ITEM 0)
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
This is a better solution for the problem addressed in the following
commit:
Btrfs: update commit root on snapshot creation after orphan cleanup
(3821f34888)
The previous solution wasn't the best because of 2 reasons:
1) It added another full transaction commit, which is more expensive
than just swapping the commit root with the root;
2) If a reboot happened after the first transaction commit (the one
that creates the snapshot) and before the second transaction commit,
then we would end up with the same problem if a send using that
snapshot was requested before the first transaction commit after
the reboot.
This change addresses those 2 issues. The second issue is addressed by
switching the commit root in the dentry lookup VFS callback, which is
also called by the snapshot/subvol creation ioctl and performs orphan
cleanup if needed. Like the vfs, the ioctl locks the parent inode too,
preventing race issues between a dentry lookup and snapshot creation.
Cc: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Commit 49c6f736f34f901117c20960ebd7d5e60f12fcac(
btrfs: dev replace should replace the sysfs entry) added the missing sysfs entry
in the process of device replace, but didn't take missing devices into account,
so now we have
BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
IP: [<ffffffffa0268551>] btrfs_kobj_rm_device+0x21/0x40 [btrfs]
...
To reproduce it,
1. mkfs.btrfs -f disk1 disk2
2. mkfs.ext4 disk1
3. mount disk2 /mnt -odegraded
4. btrfs replace start -B 1 disk3 /mnt
--------------------------
This fixes the problem.
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch changes the flock code so that it uses the TRY_1CB flag
instead of the TRY flag on the first attempt. That forces any holding
nodes to issue a dlm callback, which requests a demote of the glock.
Then, if the "try" failed, it sleeps a small amount of time for the
demote to occur. Then it tries again, for an increasing amount of time.
Subsequent attempts to gain the "try" lock don't use "_1CB" so that
only one callback is issued.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch changes some variables (especially maxlen in function
gfs2_block_map) from unsigned int to size_t. We need 64-bit arithmetic
for very large files (e.g. 1PB) where the variables otherwise get
shifted to all 0's.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Pull cifs fixes from Steve French:
"Most important fixes in this set include three SMB3 fixes for stable
(including fix for possible kernel oops), and a workaround to allow
writes to Mac servers (only cifs dialect, not more current SMB2.1,
worked to Mac servers). Also fallocate support added, and lease fix
from Jeff"
* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
[SMB3] Enable fallocate -z support for SMB3 mounts
enable fallocate punch hole ("fallocate -p") for SMB3
Incorrect error returned on setting file compressed on SMB2
CIFS: Fix wrong directory attributes after rename
CIFS: Fix SMB2 readdir error handling
[CIFS] Possible null ptr deref in SMB2_tcon
[CIFS] Workaround MacOS server problem with SMB2.1 write response
cifs: handle lease F_UNLCK requests properly
Cleanup sparse file support by creating worker function for it
Add sparse file support to SMB2/SMB3 mounts
Add missing definitions for CIFS File System Attributes
cifs: remove unused function cifs_oplock_break_wait
The journal blocks of external journal device should not
be counted as overhead.
Signed-off-by: Chin-Tsung Cheng <chintzung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
This is the errorneous scenario.
1. write data
2. do checkpoint
3. produce some dirty node pages by the gc thread
4. write back dirty node pages
5. f2fs_put_super will skip the checkpoint, since dirty count for node pages is
zero.
This patch removes such the wrong condition check.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes not to skip xattr recovery and inline xattr/data recovery
order.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
During the recovery, we should clear the inline_xattr flag if its xattr node
block is recovered.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If an inode are fsynced multiple times with fsync & dent marks, this inode will
set FI_INC_LINK at find_fsync_dnodes during the recovery.
But, in recover_inode, recover_dentry doesn't clear that flag when multiple hits
were occurred.
So this patch removes the flag for the further consistency.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If a new inode page is needed for recover_dentry, we should assing i_inline
as zero.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a parentheses to make clear for condition check.
And also it changes the return type for better meanings.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If mkwrite is called to an inode having inline_data, it can overwrite the data
index space as NEW_ADDR. (e.g., the first 4 bytes are coincidently zero)
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fix typo and some grammatical errors.
The words "filesystem" and "readahead" are being used without the space treewide.
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We did not check relocated directory in any way when processing Rock
Ridge 'CL' tag. Thus a corrupted isofs image can possibly have a CL
entry pointing to another CL entry leading to possibly unbounded
recursion in kernel code and thus stack overflow or deadlocks (if there
is a loop created from CL entries).
Fix the problem by not allowing CL entry to point to a directory entry
with CL entry (such use makes no good sense anyway) and by checking
whether CL entry doesn't point to itself.
CC: stable@vger.kernel.org
Reported-by: Chris Evans <cevans@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We have released the ->i_data_sem before invoking udf_add_entry(),
so in following error path, we should not release this lock again.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The original code allocated new chunks by the number of the writable devices
and missing devices to make sure that any RAID levels on a degraded FS continue
to be honored, but it introduced a problem that it stopped us to allocating
new chunks, the steps to reproduce is following:
# mkfs.btrfs -m raid1 -d raid1 -f <dev0> <dev1>
# mkfs.btrfs -f <dev1> //Removing <dev1> from the original fs
# mount -o degraded <dev0> <mnt>
# dd if=/dev/null of=<mnt>/tmpfile bs=1M
It is because we allocate new chunks only on the writable devices, if we take
the number of missing devices into account, and want to allocate new chunks
with higher RAID level, we will fail becaue we don't have enough writable
device. Fix it by ignoring the number of missing devices when allocating
new chunks.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
total_bytes of device is just a in-memory variant which is used to record
the size of the device, and it might be changed before we resize a device,
if the resize operation fails, it will be fallbacked. But some code used it
to update on-disk metadata of the device, it would cause the problem that
on-disk metadata of the devices was not consistent. We should use the other
variant named disk_total_bytes to update the on-disk metadata of device,
because that variant is updated only when the resize operation is successful.
Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We should not write data into a readonly device especially seed device when
doing scrub, skip those devices.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The seed filesystem was destroyed by the device replace, the reproduce
method is:
# mkfs.btrfs -f <dev0>
# btrfstune -S 1 <dev0>
# mount <dev0> <mnt>
# btrfs device add <dev1> <mnt>
# umount <mnt>
# mount <dev1> <mnt>
# btrfs replace start -f <dev0> <dev2> <mnt>
# umount <mnt>
# mount <dev0> <mnt>
It is because we erase the super block on the seed device. It is wrong,
we should not change anything on the seed device.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
When page aligned start and len passed to extent_fiemap(), the result is
good, but when start and len is not aligned, e.g. start = 1 and len =
4095 is passed to extent_fiemap(), it returns no extent.
The problem is that start and len is all rounded down which causes the
problem. This patch will round down start and round up (start + len) to
return right extent.
Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_next_leaf() will use current leaf's last key to search
and then return a bigger one. So it may still return a file extent
item that is smaller than expected value and we will
get an overflow here for @em->len.
This is easy to reproduce for Btrfs Direct writting, it did not
cause any problem, because writting will re-insert right mapping later.
However, by hacking code to make DIO support compression, wrong extent
mapping is kept and it encounter merging failure(EEXIST) quickly.
Fix this problem by looping to find next file extent item that is bigger
than @start or we could not find anything more.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
filemap_fdatawrite_range() expect the third arg to be @end
not @len, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The missing devices are accounted by its own fs device, for example
the missing devices in seed filesystem will be accounted by the fs device
of the seed filesystem, not by the new filesystem which is based on
the seed filesystem, so when we remove the missing device in the
seed filesystem, we should decrease the counter of its own fs device.
Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We forgot to zero some members in fs_devices when we create new fs_devices
from the one of the seed fs. It would cause the problem that we got wrong
chunk profile when allocating chunks. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When FS in unmounted we need to check generation number as well
since devid+uuid combination could match with the missing replaced
disk when it reappears, and without this patch it might pair with
the replaced disk again.
device_list_add() function is called in the following threads,
mount device option
mount argument
ioctl BTRFS_IOC_SCAN_DEV (btrfs dev scan)
ioctl BTRFS_IOC_DEVICES_READY (btrfs dev ready <dev>)
they have been unit tested to work fine with this patch.
If the user knows what he is doing and really want to pair with
replaced disk (which is not a standard operation), then he should
first clear the kernel btrfs device list in the memory by doing
the module unload/load and followed with the mount -o device option.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
device_list_add() is called when user runs btrfs dev scan, which would add
any btrfs device into the btrfs_fs_devices list.
Now think of a mounted btrfs. And a new device which contains the a SB
from the mounted btrfs devices.
In this situation when user runs btrfs dev scan, the current code would
just replace existing device with the new device.
Which is to note that old device is neither closed nor gracefully
removed from the btrfs.
The FS is still operational with the old bdev however the device name
is the btrfs_device is new which is provided by the btrfs dev scan.
reproducer:
devmgt[1] detach /dev/sdc
replace the missing disk /dev/sdc
btrfs rep start -f 1 /dev/sde /btrfs
Label: none uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120
Total devices 2 FS bytes used 32.00KiB
devid 1 size 958.94MiB used 115.88MiB path /dev/sde
devid 2 size 958.94MiB used 103.88MiB path /dev/sdd
make /dev/sdc to reappear
devmgt attach host2
btrfs dev scan
btrfs fi show -m
Label: none uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120^M
Total devices 2 FS bytes used 32.00KiB^M
devid 1 size 958.94MiB used 115.88MiB path /dev/sdc <- Wrong.
devid 2 size 958.94MiB used 103.88MiB path /dev/sdd
since /dev/sdc has been replaced with /dev/sde, the /dev/sdc shouldn't be
part of the btrfs-fsid when it reappears. If user want it to be part of it
then sys admin should be using btrfs device add instead.
[1] github.com/anajain/devmgt.git
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
For a non-existent key, btrfs_search_slot() sets path->slots[0] to the slot
where the key could have been present, which in this case would be the slot
containing the extent item which would be the next neighbor of the file range
being punched. The current code passes an incremented path->slots[0] and we
skip to the wrong file extent item. This would mean that we would fail to
merge the "yet to be created" hole with the next neighboring hole (if one
exists). Fix this.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The caller of btrfs_submit_direct_hook() will put the original dio bio
when btrfs_submit_direct_hook() return a error number, so we needn't
put the original bio in btrfs_submit_direct_hook().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
One of our customer's application only needs file names, not file
attributes. With directories having 10K+ inodes (assuming buffer cache
has directory blocks cached having file names, but inode cache is
limited and hence need eviction of older cached inodes), older inodes
are evicted periodically. So if they keep on doing readdir(2) from NSF
client on multiple directories, some directory's files are periodically
removed from inode cache and hence new readdir(2) on same directory
requires disk access to bring back inodes again to inode cache.
As READDIRPLUS request fetches attributes also, doing getattr on each
file on server, it causes unnecessary disk accesses. If READDIRPLUS on
NFS client is returned with -ENOTSUPP, NFS client uses READDIR request
which just gets the names of the files in a directory, not attributes,
hence avoiding disk accesses on server.
There's already a corresponding client-side mount option, but an export
option reduces the need for configuration across multiple clients.
This flag affects NFSv3 only. If it turns out it's needed for NFSv4 as
well then we may have to figure out how to extend the behavior to NFSv4,
but it's not currently obvious how to do that.
Signed-off-by: Rajesh Ghanekar <rajesh_ghanekar@symantec.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
fallocate -z (FALLOC_FL_ZERO_RANGE) can map to SMB3
FSCTL_SET_ZERO_DATA SMB3 FSCTL but FALLOC_FL_ZERO_RANGE
when called without the FALLOC_FL_KEEPSIZE flag set could want
the file size changed so we can not support that subcase unless
the file is cached (and thus we know the file size).
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
Implement FALLOC_FL_PUNCH_HOLE (which does not change the file size
fortunately so this matches the behavior of the equivalent SMB3
fsctl call) for SMB3 mounts. This allows "fallocate -p" to work.
It requires that the server support setting files as sparse
(which Windows allows).
Signed-off-by: Steve French <smfrench@gmail.com>
When the server (for an SMB2 or SMB3 mount) doesn't support
an ioctl (such as setting the compressed flag
on a file) we were incorrectly returning EIO instead
of EOPNOTSUPP, this is confusing e.g. doing chattr +c to a file
on a non-btrfs Samba partition, now the error returned is more
intuitive to the user. Also fixes error mapping on setting
hardlink to servers which don't support that.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: David Disseldorp <ddiss@suse.de>
As of 8c7424cff6 "nfsd4: don't try to encode conflicting owner if low
on space", we permit the server to process a LOCK operation even if
there might not be space to return the conflicting lockowner, because
we've made returning the conflicting lockowner optional.
However, the rpc server still wants to know the most we might possibly
return, so we need to take into account the possible conflicting
lockowner in the svc_reserve_space() call here.
Symptoms were log messages like "RPC request reserved 88 but used 108".
Fixes: 8c7424cff6 "nfsd4: don't try to encode conflicting owner if low on space"
Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When creating a file that already exists in a read-only directory with
O_EXCL, the NFSv3 server returns EACCES rather than EEXIST (which local
files and the NFSv4 server return). Fix this by checking the MAY_CREATE
permission only if the file does not exist. Since this already happens
in do_nfsd_create, the check in nfsd3_proc_create can simply be removed.
Signed-off-by: Ross Lagerwall <rosslagerwall@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently, we hold the state_lock when releasing the lease. That's
potentially problematic in the future if we allow for setlease methods
that can sleep. Move the nfs4_put_deleg_lease call out of the delegation
unhashing routine (which was always a bit goofy anyway), and into the
unlocked sections of the callers of unhash_delegation_locked.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently these fields are protected with the state_lock, but that
doesn't really make a lot of sense. These fields are "private" to the
nfs4_file, and can be protected with the more granular fi_lock.
The fi_lock is already held when setting these fields. Make the code
hold the fp->fi_lock when clearing the lease-related fields in the
nfs4_file, and no longer require that the state_lock be held when
calling into this function.
To prevent lock inversion with the i_lock, we also move the vfs_setlease
and fput calls outside of the fi_lock. This also sets us up for allowing
vfs_setlease calls to block in the future.
Finally, remove a redundant NULL pointer check. unhash_delegation_locked
locks the fp->fi_lock prior to that check, so fp in that function must
never be NULL.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We would normally expect the xid and the checksum to be the best
discriminators. Check them before looking at the procedure number,
etc.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
...so we can remove the spinlocking around it.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Now that the lru list is per-bucket, we don't need a second list for
searches.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When we requests rename we also need to update attributes
of both source and target parent directories. Not doing it
causes generic/309 xfstest to fail on SMB2 mounts. Fix this
by marking these directories for force revalidating.
Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
SMB2 servers indicates the end of a directory search with
STATUS_NO_MORE_FILE error code that is not processed now.
This causes generic/257 xfstest to fail. Fix this by triggering
the end of search by this error code in SMB2_query_directory.
Also when negotiating CIFS protocol we tell the server to close
the search automatically at the end and there is no need to do
it itself. In the case of SMB2 protocol, we need to close it
explicitly - separate close directory checks for different
protocols.
Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
As Raphael Geissert pointed out, tcon_error_exit can dereference tcon
and there is one path in which tcon can be null.
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org> # v3.7+
Reported-by: Raphael Geissert <geissert@debian.org>
Pull btrfs updates from Chris Mason:
"These are all fixes I'd like to get out to a broader audience.
The biggest of the bunch is Mark's quota fix, which is also in the
SUSE kernel, and makes our subvolume quotas dramatically more
accurate.
I've been running xfstests with these against your current git
overnight, but I'm queueing up longer tests as well"
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: disable strict file flushes for renames and truncates
Btrfs: fix csum tree corruption, duplicate and outdated checksums
Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch
Btrfs: fix compressed write corruption on enospc
btrfs: correctly handle return from ulist_add
btrfs: qgroup: account shared subtrees during snapshot delete
Btrfs: read lock extent buffer while walking backrefs
Btrfs: __btrfs_mod_ref should always use no_quota
btrfs: adjust statfs calculations according to raid profiles
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJT7ey6AAoJEAAOaEEZVoIVgdkQAJfToVhgddVMOweeGo4wqPqM
lmS35dYVEy+gPfYhCU2Zytgk9yLlmNJeDT7Y+XGe4l15Ax/PDNuWLiRnUKRy1FrU
l7cbKRAn1L6TqBO2ang5t68Bw7ojUpRoKWKnDyjVAj80mZYRZPWvQOurLeTtra2o
XLnHbK54F36s07OXwSTZgbh/ffVQ1RMUBU8fy+0Ws3mTAbzO1KuB5Ws4pdt2ysjI
14pBHO553X0VXAJxGkH66xxblt1o+Q9aBZp7RL0VNtR4bGU4FMvXy+0D5G6h8AGt
rhl2PWTNKGg8HFgUK+7nCheH4j/0maXo541D3q+sJhbknRhD3n3x7IBvjkm9tdjB
OgTkAp1mwL21mJP21MOrAil8uwGoSr7qTCngZY6nNWH4L3EkVl27+LlGDtkgBp/n
BJyhcPbFSh+K4TTD3dg2rEx7wF/npQ6yPOljjvWXKJEm5lx3ZPkK1l1xS3QnVTMe
pLq+wTZ9v1cU7+9JFWICQwclv4unjIgqxLo7/op5P5KvTWOHFW4cjdwCBPVE1g3a
WC2c5jdB0Up6Z59aXAN3p5dk8MCy6NA41lkMfJN4AUAIzb6NvNeDBhrcFaHwXowm
jgCtEPqFN/HqsEwJmhJ7fcIEKYrOCuzeaR5tGuwJ11re/oULGXTgMkrFwgm/YWwu
esRBAc53/hZg6oo3ipUH
=09BX
-----END PGP SIGNATURE-----
Merge tag 'locks-v3.17-2' of git://git.samba.org/jlayton/linux
Pull file locking bugfixes from Jeff Layton:
"Most of these patches are to fix a long-standing regression that crept
in when the BKL was removed from the file-locking code. The code was
converted to use a conventional spinlock, but some fl_release_private
ops can block and you can end up sleeping inside the lock.
There's also a patch to make /proc/locks show delegations as 'DELEG'"
* tag 'locks-v3.17-2' of git://git.samba.org/jlayton/linux:
locks: update Locking documentation to clarify fl_release_private behavior
locks: move locks_free_lock calls in do_fcntl_add_lease outside spinlock
locks: defer freeing locks in locks_delete_lock until after i_lock has been dropped
locks: don't reuse file_lock in __posix_lock_file
locks: don't call locks_release_private from locks_copy_lock
locks: show delegations as "DELEG" in /proc/locks
Pull aio updates from Ben LaHaise.
* git://git.kvack.org/~bcrl/aio-next:
aio: use iovec array rather than the single one
aio: fix some comments
aio: use the macro rather than the inline magic number
aio: remove the needless registration of ring file's private_data
aio: remove no longer needed preempt_disable()
aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx()
aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()
response
Writes fail to Mac servers with SMB2.1 mounts (works with cifs though) due
to them sending an incorrect RFC1001 length for the SMB2.1 Write response.
Workaround this problem. MacOS server sends a write response with 3 bytes
of pad beyond the end of the SMB itself. The RFC1001 length is 3 bytes
more than the sum of the SMB2.1 header length + the write reponse.
Incorporate feedback from Jeff and JRA to allow servers to send
a tcp frame that is even more than three bytes too long
(ie much longer than the SMB2/SMB3 request that it contains) but
we do log it once now. In the earlier version of the patch I had
limited how far off the length field could be before we fail the request.
Signed-off-by: Steve French <smfrench@gmail.com>
Currently any F_UNLCK request for a lease just gets back -EAGAIN. Allow
them to go immediately to generic_setlease instead.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Simply move code to new function (for clarity). Function sets or clears
the sparse file attribute flag.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: David Disseldorp <ddiss@samba.org>
Truncates and renames are often used to replace old versions of a file
with new versions. Applications often expect this to be an atomic
replacement, even if they haven't done anything to make sure the new
version is fully on disk.
Btrfs has strict flushing in place to make sure that renaming over an
old file with a new file will fully flush out the new file before
allowing the transaction commit with the rename to complete.
This ordering means the commit code needs to be able to lock file pages,
and there are a few paths in the filesystem where we will try to end a
transaction with the page lock held. It's rare, but these things can
deadlock.
This patch removes the ordered flushes and switches to a best effort
filemap_flush like ext4 uses. It's not perfect, but it should fix the
deadlocks.
Signed-off-by: Chris Mason <clm@fb.com>
Under rare circumstances we can end up leaving 2 versions of a checksum
for the same file extent range.
The reason for this is that after calling btrfs_next_leaf we process
slot 0 of the leaf it returns, instead of processing the slot set in
path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
btrfs_next_leaf() releases the path and before it searches for the next
leaf, another task might cause a split of the next leaf, which migrates
some of its keys to the leaf we were processing before calling
btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
same leaf but with path->slots[0] having a slot number corresponding
to the first new key it got, that is, a slot number that didn't exist
before calling btrfs_next_leaf(), as the leaf now has more keys than
it had before. So we must really process the returned leaf starting at
path->slots[0] always, as it isn't always 0, and the key at slot 0 can
have an offset much lower than our search offset/bytenr.
For example, consider the following scenario, where we have:
sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472
Leaf N:
slot = 0 slot = btrfs_header_nritems() - 1
|-------------------------------------------------------------------|
| [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
|-------------------------------------------------------------------|
Leaf N + 1:
slot = 0 slot = btrfs_header_nritems() - 1
|--------------------------------------------------------------------|
| [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
|--------------------------------------------------------------------|
Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
find the next highest key, which releases the current path and then searches
for that next key. However after releasing the path and before finding that
next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
btrfs_next_leaf() will returns us a path again with leaf N but with the slot
pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
is then:
slot = 0 slot = btrfs_header_nritems() - 2 slot = btrfs_header_nritems() - 1
|----------------------------------------------------------------------------------------------------|
| [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] [(CSUM CSUM 40161280), size 32] |
|----------------------------------------------------------------------------------------------------|
And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
into the "insert:" label, which will set tmp to:
tmp = min((sums->len - total_bytes) >> blocksize_bits,
(next_offset - file_key.offset) >> blocksize_bits) =
min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4
and
ins_size = csum_size * tmp = 4 * 4 = 16 bytes.
In other words, we insert a new csum item in the tree with key
(CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
because the item with key (CSUM CSUM 40161280) (the one that was moved from
leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
bytes of our data and won't get those old checksums removed.
So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
and breaks the logical rule:
Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover
An obvious bad effect of this is that a subsequent csum tree lookup to get
the checksum of any of the blocks with logical offset of 40161280, 40165376
or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We've got bug reports that btrfs crashes when quota is enabled on
32bit kernel, typically with the Oops like below:
BUG: unable to handle kernel NULL pointer dereference at 00000004
IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs]
*pde = 00000000
Oops: 0000 [#1] SMP
CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S W 3.15.2-1.gd43d97e-default #1
Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
task: f1478130 ti: f147c000 task.ti: f147c000
EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0
EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000
ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690
Stack:
00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
Call Trace:
[<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
[<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
[<f9206148>] normal_work_helper+0xc8/0x270 [btrfs]
[<c025e38b>] process_one_work+0x11b/0x390
[<c025eea1>] worker_thread+0x101/0x340
[<c026432b>] kthread+0x9b/0xb0
[<c0712a71>] ret_from_kernel_thread+0x21/0x30
[<c0264290>] kthread_create_on_node+0x110/0x110
This indicates a NULL corruption in prefs_delayed list. The further
investigation and bisection pointed that the call of ulist_add_merge()
results in the corruption.
ulist_add_merge() takes u64 as aux and writes a 64bit value into
old_aux. The callers of this function in backref.c, however, pass a
pointer of a pointer to old_aux. That is, the function overwrites
64bit value on 32bit pointer. This caused a NULL in the adjacent
variable, in this case, prefs_delayed.
Here is a quick attempt to band-aid over this: a new function,
ulist_add_merge_ptr() is introduced to pass/store properly a pointer
value instead of u64. There are still ugly void ** cast remaining
in the callers because void ** cannot be taken implicitly. But, it's
safer than explicit cast to u64, anyway.
Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046
Cc: <stable@vger.kernel.org> [v3.11+]
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
When failing to allocate space for the whole compressed extent, we'll
fallback to uncompressed IO, but we've forgotten to redirty the pages
which belong to this compressed extent, and these 'clean' pages will
simply skip 'submit' part and go to endio directly, at last we got data
corruption as we write nothing.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Tested-By: Martin Steigerwald <martin@lichtvoll.de>
Signed-off-by: Chris Mason <clm@fb.com>
ulist_add() can return '1' on sucess, which qgroup_subtree_accounting()
doesn't take into account. As a result, that value can be bubbled up to
callers, causing an error to be printed. Fix this by only returning the
value of ulist_add() when it indicates an error.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
During its tree walk, btrfs_drop_snapshot() will skip any shared
subtrees it encounters. This is incorrect when we have qgroups
turned on as those subtrees need to have their contents
accounted. In particular, the case we're concerned with is when
removing our snapshot root leaves the subtree with only one root
reference.
In those cases we need to find the last remaining root and add
each extent in the subtree to the corresponding qgroup exclusive
counts.
This patch implements the shared subtree walk and a new qgroup
operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
this type is encountered during qgroup accounting, we search for
any root references to that extent and in the case that we find
only one reference left, we go ahead and do the math on it's
exclusive counts.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Before processing the extent buffer, acquire a read lock on it, so
that we're safe against concurrent updates on the extent buffer.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
understand how snapshot delete was using it and assumed that we needed the
quota operations there. With Mark's work this has turned out to be not the
case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
argument and make __btrfs_mod_ref call it's process function with no_quota set
always. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
This has been discussed in thread:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/32528
and this patch implements this proposal:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/32536
Works fine for "clean" raid profiles where the raid factor correction
does the right job. Otherwise it's pessimistic and may show low space
although there's still some left.
The df nubmers are lightly wrong in case of mixed block groups, but this
is not a major usecase and can be addressed later.
The RAID56 numbers are wrong almost the same way as before and will be
addressed separately.
CC: Hugo Mills <hugo@carfax.org.uk>
CC: cwillu <cwillu@cwillu.com>
CC: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
There's no need to call locks_free_lock here while still holding the
i_lock. Defer that until the lock has been dropped.
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
In commit 72f98e7255 (locks: turn lock_flocks into a spinlock), we
moved from using the BKL to a global spinlock. With this change, we lost
the ability to block in the fl_release_private operation.
This is problematic for NFS (and probably some other filesystems as
well). Add a new list_head argument to locks_delete_lock. If that
argument is non-NULL, then queue any locks that we want to free to the
list instead of freeing them.
Then, add a new locks_dispose_list function that will walk such a list
and call locks_free_lock on them after the i_lock has been dropped.
Finally, change all of the callers of locks_delete_lock to pass in a
list_head, except for lease_modify. That function can be called long
after the i_lock has been acquired. Deferring the freeing of a lease
after unlocking it in that function is non-trivial until we overhaul
some of the spinlocking in the lease code.
Currently though, no filesystem that sets fl_release_private supports
leases, so this is not currently a problem. We'll eventually want to
make the same change in the lease code, but it needs a lot more work
before we can reasonably do so.
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Currently in the case where a new file lock completely replaces the old
one, we end up overwriting the existing lock with the new info. This
means that we have to call fl_release_private inside i_lock. Change the
code to instead copy the info to new_fl, insert that lock into the
correct spot and then delete the old lock. In a later patch, we'll defer
the freeing of the old lock until after the i_lock has been dropped.
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Highlights include:
- Stable fix for a bug in nfs3_list_one_acl()
- Speed up NFS path walks by supporting LOOKUP_RCU
- More read/write code cleanups
- pNFS fixes for layout return on close
- Fixes for the RCU handling in the rpcsec_gss code
- More NFS/RDMA fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJT65zoAAoJEGcL54qWCgDyvq8QAJ+OKuC5dpngrZ13i4ZJIcK1
TJSkWCr44FhYPlrmkLCntsGX6C0376oFEtJ5uqloqK0+/QtvwRNVSQMKaJopKIVY
mR4En0WwpigxVQdW2lgto6bfOhzMVO+llVdmicEVrU8eeSThATxGNv7rxRzWorvL
RX3TwBkWSc0kLtPi66VRFQ1z+gg5I0kngyyhsKnLOaHHtpTYP2JDZlRPRkokXPUg
nmNedmC3JrFFkarroFIfYr54Qit2GW/eI2zVhOwHGCb45j4b2wntZ6wr7LpUdv3A
OGDBzw59cTpcx3Hij9CFvLYVV9IJJHBNd2MJqdQRtgWFfs+aTkZdk4uilUJCIzZh
f4BujQAlm/4X1HbPxsSvkCRKga7mesGM7e0sBDPHC1vu0mSaY1cakcj2kQLTpbQ7
gqa1cR3pZ+4shCq37cLwWU0w1yElYe1c4otjSCttPCrAjXbXJZSFzYnHm8DwKROR
t+yEDRL5BIXPu1nEtSnD2+xTQ3vUIYXooZWEmqLKgRtBTtPmgSn9Vd8P1OQXmMNo
VJyFXyjNx5WH06Wbc/jLzQ1/cyhuPmJWWyWMJlVROyv+FXk9DJUFBZuTkpMrIPcF
NlBXLV1GnA7PzMD9Xt9bwqteERZl6fOUDJLWS9P74kTk5c2kD+m+GaqC/rBTKKXc
ivr2s7aIDV48jhnwBSVL
=KE07
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
- stable fix for a bug in nfs3_list_one_acl()
- speed up NFS path walks by supporting LOOKUP_RCU
- more read/write code cleanups
- pNFS fixes for layout return on close
- fixes for the RCU handling in the rpcsec_gss code
- more NFS/RDMA fixes"
* tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits)
nfs: reject changes to resvport and sharecache during remount
NFS: Avoid infinite loop when RELEASE_LOCKOWNER getting expired error
SUNRPC: remove all refcounting of groupinfo from rpcauth_lookupcred
NFS: fix two problems in lookup_revalidate in RCU-walk
NFS: allow lockless access to access_cache
NFS: teach nfs_lookup_verify_inode to handle LOOKUP_RCU
NFS: teach nfs_neg_need_reval to understand LOOKUP_RCU
NFS: support RCU_WALK in nfs_permission()
sunrpc/auth: allow lockless (rcu) lookup of credential cache.
NFS: prepare for RCU-walk support but pushing tests later in code.
NFS: nfs4_lookup_revalidate: only evaluate parent if it will be used.
NFS: add checks for returned value of try_module_get()
nfs: clear_request_commit while holding i_lock
pnfs: add pnfs_put_lseg_async
pnfs: find swapped pages on pnfs commit lists too
nfs: fix comment and add warn_on for PG_INODE_REF
nfs: check wait_on_bit_lock err in page_group_lock
sunrpc: remove "ec" argument from encrypt_v2 operation
sunrpc: clean up sparse endianness warnings in gss_krb5_wrap.c
sunrpc: clean up sparse endianness warnings in gss_krb5_seal.c
...
This update contains:
o conversion of the XFS core to pass negative error numbers
o restructing of core XFS code that is shared with userspace to fs/xfs/libxfs
o introduction of sysfs interface for XFS
o bulkstat refactoring
o demand driven speculative preallocation removal
o XFS now always requires 64 bit sectors to be configured
o metadata verifier changes to ensure CRCs are calculated during log recovery
o various minor code cleanups
o miscellaneous bug fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJT6dJvAAoJEK3oKUf0dfodYOYP/jhWGv29OrX958WC+U1kCItb
WmNV6dSENWgtgYRgdtcxa5ZE0HeEndvN63LOCzAB9VegZK7XjpLob9J2f07gMvP1
u4vv9LvgLdky/tYxttBYHuyen0hQO0r+esRg/xrawpklKQ3Efofi4MUSbb5ZBDzP
QvVeNhIDI04A0CoDngWDkV1PpbdwDUjVZFZon36XVDTcSCf/h2B+nekjOVJQpEDC
qJ9nZWRgAcm4IzZZBqwGt0LJ62Z+Ww8jksevWEnjcXm1xfsltL8gf8o6WOX5iXDR
PSeoJwVRAxLfiQxnQSENEOeLvKWVXNyC/B10w1H20qZkvQJMsX/Fq0MraBL0Ediu
0qyZgJsOyoOAvJYXWfUyykUe03dF5/cgrB+RpOLmp4B9lCugCEsQHtjRXEHE1EK2
+4sHc13I7+SLIAlgySmJdrLe8Kq1vamo0/WD2wH+pdOvNmQ7HJzBUFEj1D83b55a
DWfBjbz/N5lJW82Vfek2coURJGi/1B87koAtODXWfeM+nbBs47z0dv31G235Boq3
+Qy2qQdCv05xV+CZn/RAmaudAFFnJEV8L4ADpwqSGfVM72ug0KQF8M5DblG1Q8sb
xRAQq+krG7K4ui1kPNKcxoHoUIYQUJm0p1aU4V2OAQlikYCCaSgRYGF2xzSVC7j9
uZd4e4bllsBdPF/Se+8W
=EUjq
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-3.17-rc1' of git://oss.sgi.com/xfs/xfs
Pull xfs update from Dave Chinner:
"This update contains:
- conversion of the XFS core to pass negative error numbers
- restructing of core XFS code that is shared with userspace to
fs/xfs/libxfs
- introduction of sysfs interface for XFS
- bulkstat refactoring
- demand driven speculative preallocation removal
- XFS now always requires 64 bit sectors to be configured
- metadata verifier changes to ensure CRCs are calculated during log
recovery
- various minor code cleanups
- miscellaneous bug fixes
The diffstat is kind of noisy because of the restructuring of the code
to make kernel/userspace code sharing simpler, along with the XFS wide
change to use the standard negative error return convention (at last!)"
* tag 'xfs-for-linus-3.17-rc1' of git://oss.sgi.com/xfs/xfs: (45 commits)
xfs: fix coccinelle warnings
xfs: flush both inodes in xfs_swap_extents
xfs: fix swapext ilock deadlock
xfs: kill xfs_vnode.h
xfs: kill VN_MAPPED
xfs: kill VN_CACHED
xfs: kill VN_DIRTY()
xfs: dquot recovery needs verifiers
xfs: quotacheck leaves dquot buffers without verifiers
xfs: ensure verifiers are attached to recovered buffers
xfs: catch buffers written without verifiers attached
xfs: avoid false quotacheck after unclean shutdown
xfs: fix rounding error of fiemap length parameter
xfs: introduce xfs_bulkstat_ag_ichunk
xfs: require 64-bit sector_t
xfs: fix uflags detection at xfs_fs_rm_xquota
xfs: remove XFS_IS_OQUOTA_ON macros
xfs: tidy up xfs_set_inode32
xfs: allow inode allocations in post-growfs disk space
xfs: mark xfs_qm_quotacheck as static
...
Pull quota, reiserfs, UDF updates from Jan Kara:
"Scalability improvements for quota, a few reiserfs fixes, and couple
of misc cleanups (udf, ext2)"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: Fix use after free in journal teardown
reiserfs: fix corruption introduced by balance_leaf refactor
udf: avoid redundant memcpy when writing data in ICB
fs/udf: re-use hex_asc_upper_{hi,lo} macros
fs/quota: kernel-doc warning fixes
udf: use linux/uaccess.h
fs/ext2/super.c: Drop memory allocation cast
quota: remove dqptr_sem
quota: simplify remove_inode_dquot_ref()
quota: avoid unnecessary dqget()/dqput() calls
quota: protect Q_GETFMT by dqonoff_mutex
Pull Ceph updates from Sage Weil:
"There is a lot of refactoring and hardening of the libceph and rbd
code here from Ilya that fix various smaller bugs, and a few more
important fixes with clone overlap. The main fix is a critical change
to the request_fn handling to not sleep that was exposed by the recent
mutex changes (which will also go to the 3.16 stable series).
Yan Zheng has several fixes in here for CephFS fixing ACL handling,
time stamps, and request resends when the MDS restarts.
Finally, there are a few cleanups from Himangi Saraogi based on
Coccinelle"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits)
libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
rbd: remove extra newlines from rbd_warn() messages
rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC
rbd: rework rbd_request_fn()
ceph: fix kick_requests()
ceph: fix append mode write
ceph: fix sizeof(struct tYpO *) typo
ceph: remove redundant memset(0)
rbd: take snap_id into account when reading in parent info
rbd: do not read in parent info before snap context
rbd: update mapping size only on refresh
rbd: harden rbd_dev_refresh() and callers a bit
rbd: split rbd_dev_spec_update() into two functions
rbd: remove unnecessary asserts in rbd_dev_image_probe()
rbd: introduce rbd_dev_header_info()
rbd: show the entire chain of parent images
ceph: replace comma with a semicolon
rbd: use rbd_segment_name_free() instead of kfree()
ceph: check zero length in ceph_sync_read()
ceph: reset r_resend_mds after receiving -ESTALE
...
fixes are:
* UBI deleted list items while iterating the list with 'list_for_each_entry'
* The UBI block driver did not work properly with very large UBI volumes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJT6IALAAoJECmIfjd9wqK0yRwP/R4XpNyWZOrK72fvb3M6ucK0
SK3pu6FNfvOSYibmhOtZWjFbiMkgUd7o8jy+4uJa6PIMRQbKGrH/xvujPzON6GQT
ZZxeJ7n1O8IAZL/dibqFpEv3NQFjgz1IaLnpktQkkp6rWsXCBYNbvFIywWYgkndt
1w11QG/TAhkJcA/0Xy/+zmkHeXjOQrVeVMbBN2MMVekWg8JgMNTd9mkWGa1A1oGI
nuq30nIkUP7LX9T/olFGhjFIeeb/bkmWgpzS4K9PI+hOGGCZvIWj/pu8Jj8DAR54
UGA7qtMkTjxldvHPnuACsSKNS/N7UGY64kTwcucEcJaN2MXmaBFP8ipI/wtb9bCV
fUoBt+p470E7iDoQymbM5zNw/HPNh4XWVd2X1oYVftmO70PZ6sWeOKWC+tJn/Aj8
xsrgd1PcWmuyu399EFW/Il5ItOqfYsNIkVFNzIb1O6Pd8Ylt5eYtuyoPXk5aRA4p
xZ3ohO3OihvWXwwtCUjforwPy37iaX8vwnuI9xdgnEisTHJWR6PYITvcrwqAIH44
6PEqINU4zwGTGxOTOWKZxdtPVIuChuwqDyY+7+xQD+LIPQ1/BodozExGiQ0lhAmA
DDC2Te8q5lpZTz03E8ot9zZHmjQoG+uo10DhEzT5U1ycv9p7dISGfbVucyuKDTw9
JdUnI8cZ2k6mZi8q6dfA
=zgqa
-----END PGP SIGNATURE-----
Merge tag 'upstream-3.17-rc1' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS changes from Artem Bityutskiy:
"No significant changes, mostly small fixes here and there. The more
important fixes are:
- UBI deleted list items while iterating the list with
'list_for_each_entry'
- The UBI block driver did not work properly with very large UBI
volumes"
* tag 'upstream-3.17-rc1' of git://git.infradead.org/linux-ubifs: (21 commits)
UBIFS: Add log overlap assertions
Revert "UBIFS: add a log overlap assertion"
UBI: bugfix in ubi_wl_flush()
UBI: block: Avoid disk size integer overflow
UBI: block: Set disk_capacity out of the mutex
UBI: block: Make ubiblock_resize return something
UBIFS: add a log overlap assertion
UBIFS: remove unnecessary check
UBIFS: remove mst_mutex
UBIFS: kernel-doc warning fix
UBI: init_volumes: Ignore volumes with no LEBs
UBIFS: replace seq_printf by seq_puts
UBIFS: replace count*size kzalloc by kcalloc
UBIFS: kernel-doc warning fix
UBIFS: fix error path in create_default_filesystem()
UBIFS: fix spelling of "scanned"
UBIFS: fix some comments
UBIFS: remove useless @ecc in struct ubifs_scan_leb
UBIFS: remove useless statements
UBIFS: Add missing break statements in dbg_chk_pnode()
...
Many Linux filesystes make a file "sparse" when extending
a file with ftruncate. This does work for CIFS to Samba
(only) but not for SMB2/SMB3 (to Samba or Windows) since
there is a "set sparse" fsctl which is supposed to be
sent to mark a file as sparse.
This patch marks a file as sparse by sending this simple
set sparse fsctl if it is extended more than 2 pages.
It has been tested to Windows 8.1, Samba and various
SMB2/SMB3 servers which do support setting sparse (and
MacOS which does not appear to support the fsctl yet).
If a server share does not support setting a file
as sparse, then we do not retry setting sparse on that
share.
The disk space savings for sparse files can be quite
large (even more significant on Windows servers than Samba).
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
If do_journal_release() races with do_journal_end() which requeues
delayed works for transaction flushing, we can leave work items for
flushing outstanding transactions queued while freeing them. That
results in use after free and possible crash in run_timers_softirq().
Fix the problem by not requeueing works if superblock is being shut down
(MS_ACTIVE not set) and using cancel_delayed_work_sync() in
do_journal_release().
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Pull vfs updates from Al Viro:
"Stuff in here:
- acct.c fixes and general rework of mnt_pin mechanism. That allows
to go for delayed-mntput stuff, which will permit mntput() on deep
stack without worrying about stack overflows - fs shutdown will
happen on shallow stack. IOW, we can do Eric's umount-on-rmdir
series without introducing tons of stack overflows on new mntput()
call chains it introduces.
- Bruce's d_splice_alias() patches
- more Miklos' rename() stuff.
- a couple of regression fixes (stable fodder, in the end of branch)
and a fix for API idiocy in iov_iter.c.
There definitely will be another pile, maybe even two. I'd like to
get Eric's series in this time, but even if we miss it, it'll go right
in the beginning of for-next in the next cycle - the tricky part of
prereqs is in this pile"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
fix copy_tree() regression
__generic_file_write_iter(): fix handling of sync error after DIO
switch iov_iter_get_pages() to passing maximal number of pages
fs: mark __d_obtain_alias static
dcache: d_splice_alias should detect loops
exportfs: update Exporting documentation
dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
dcache: remove unused d_find_alias parameter
dcache: d_obtain_alias callers don't all want DISCONNECTED
dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
dcache: d_splice_alias mustn't create directory aliases
dcache: close d_move race in d_splice_alias
dcache: move d_splice_alias
namei: trivial fix to vfs_rename_dir comment
VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
cifs: support RENAME_NOREPLACE
hostfs: support rename flags
shmem: support RENAME_EXCHANGE
shmem: support RENAME_NOREPLACE
btrfs: add RENAME_NOREPLACE
...
All callers of locks_copy_lock pass in a brand new file_lock struct, so
there's no need to call locks_release_private on it. Replace that with
a warning that fires in the event that we receive a target lock that
doesn't look like it's properly initialized.
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Now that they are a distinct lease type, show them as such.
Cc: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Since 3.14 we had copy_tree() get the shadowing wrong - if we had one
vfsmount shadowing another (i.e. if A is a slave of B, C is mounted
on A/foo, then D got mounted on B/foo creating D' on A/foo shadowed
by C), copy_tree() of A would make a copy of D' shadow the the copy of
C, not the other way around.
It's easy to fix, fortunately - just make sure that mount follows
the one that shadows it in mnt_child as well as in mnt_hash, and when
copy_tree() decides to attach a new mount, check if the last child
it has added to the same parent should be shadowing the new one.
And if it should, just use the same logics commit_tree() has - put the
new mount into the hash and children lists right after the one that
should shadow it.
Cc: stable@vger.kernel.org [3.14 and later]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 743162013d ("sched: Remove proliferation of wait_on_bit() action
functions") has removed the call to cifs_oplock_break_wait, making this
function unused; remove it.
This fixes the following compilation warning:
fs/cifs/misc.c:578:1: warning: ‘cifs_oplock_break_wait’ defined but not used [-Wunused-function]
Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
This reverts commits 344470cac4 and e813244072.
It turns out that the exact path in the symlink matters, if for somewhat
unfortunate reasons: some apparmor configurations don't allow dhclient
access to the per-thread /proc files. As reported by Jörg Otte:
audit: type=1400 audit(1407684227.003:28): apparmor="DENIED"
operation="open" profile="/sbin/dhclient"
name="/proc/1540/task/1540/net/dev" pid=1540 comm="dhclient"
requested_mask="r" denied_mask="r" fsuid=0 ouid=0
so we had better revert this for now. We might be able to work around
this in practice by only using the per-thread symlinks if the thread
isn't the thread group leader, and if the namespaces differ between
threads (which basically never happens).
We'll see. In the meantime, the revert was made to be intentionally easy.
Reported-by: Jörg Otte <jrg.otte@gmail.com>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull namespace updates from Eric Biederman:
"This is a bunch of small changes built against 3.16-rc6. The most
significant change for users is the first patch which makes setns
drmatically faster by removing unneded rcu handling.
The next chunk of changes are so that "mount -o remount,.." will not
allow the user namespace root to drop flags on a mount set by the
system wide root. Aks this forces read-only mounts to stay read-only,
no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec
mounts to stay no exec and it prevents unprivileged users from messing
with a mounts atime settings. I have included my test case as the
last patch in this series so people performing backports can verify
this change works correctly.
The next change fixes a bug in NFS that was discovered while auditing
nsproxy users for the first optimization. Today you can oops the
kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever
with pid namespaces. I rebased and fixed the build of the
!CONFIG_NFS_FS case yesterday when a build bot caught my typo. Given
that no one to my knowledge bases anything on my tree fixing the typo
in place seems more responsible that requiring a typo-fix to be
backported as well.
The last change is a small semantic cleanup introducing
/proc/thread-self and pointing /proc/mounts and /proc/net at it. This
prevents several kinds of problemantic corner cases. It is a
user-visible change so it has a minute chance of causing regressions
so the change to /proc/mounts and /proc/net are individual one line
commits that can be trivially reverted. Unfortunately I lost and
could not find the email of the original reporter so he is not
credited. From at least one perspective this change to /proc/net is a
refgression fix to allow pthread /proc/net uses that were broken by
the introduction of the network namespace"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
proc: Implement /proc/thread-self to point at the directory of the current thread
proc: Have net show up under /proc/<tgid>/task/<tid>
NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
mnt: Add tests for unprivileged remount cases that have found to be faulty
mnt: Change the default remount atime from relatime to the existing value
mnt: Correct permission checks in do_remount
mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
mnt: Only change user settable mount flags in remount
namespaces: Use task_lock and not rcu to protect nsproxy
Pull nfsd updates from Bruce Fields:
"This includes a major rewrite of the NFSv4 state code, which has
always depended on a single mutex. As an example, open creates are no
longer serialized, fixing a performance regression on NFSv3->NFSv4
upgrades. Thanks to Jeff, Trond, and Benny, and to Christoph for
review.
Also some RDMA fixes from Chuck Lever and Steve Wise, and
miscellaneous fixes from Kinglong Mee and others"
* 'for-3.17' of git://linux-nfs.org/~bfields/linux: (167 commits)
svcrdma: remove rdma_create_qp() failure recovery logic
nfsd: add some comments to the nfsd4 object definitions
nfsd: remove the client_mutex and the nfs4_lock/unlock_state wrappers
nfsd: remove nfs4_lock_state: nfs4_state_shutdown_net
nfsd: remove nfs4_lock_state: nfs4_laundromat
nfsd: Remove nfs4_lock_state(): reclaim_complete()
nfsd: Remove nfs4_lock_state(): setclientid, setclientid_confirm, renew
nfsd: Remove nfs4_lock_state(): exchange_id, create/destroy_session()
nfsd: Remove nfs4_lock_state(): nfsd4_open and nfsd4_open_confirm
nfsd: Remove nfs4_lock_state(): nfsd4_delegreturn()
nfsd: Remove nfs4_lock_state(): nfsd4_open_downgrade + nfsd4_close
nfsd: Remove nfs4_lock_state(): nfsd4_lock/locku/lockt()
nfsd: Remove nfs4_lock_state(): nfsd4_release_lockowner
nfsd: Remove nfs4_lock_state(): nfsd4_test_stateid/nfsd4_free_stateid
nfsd: Remove nfs4_lock_state(): nfs4_preprocess_stateid_op()
nfsd: remove old fault injection infrastructure
nfsd: add more granular locking to *_delegations fault injectors
nfsd: add more granular locking to forget_openowners fault injector
nfsd: add more granular locking to forget_locks fault injector
nfsd: add a list_head arg to nfsd_foreach_client_lock
...
Pull CIFS updates from Steve French:
"The most visible change in this set is the additional of multi-credit
support for SMB2/SMB3 which dramatically improves the large file i/o
performance for these dialects and significantly increases the maximum
i/o size used on the wire for SMB2/SMB3.
Also reconnection behavior after network failure is improved"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (35 commits)
Add worker function to set allocation size
[CIFS] Fix incorrect hex vs. decimal in some debug print statements
update CIFS TODO list
Add Pavel to contributor list in cifs AUTHORS file
Update cifs version
CIFS: Fix STATUS_CANNOT_DELETE error mapping for SMB2
CIFS: Optimize readpages in a short read case on reconnects
CIFS: Optimize cifs_user_read() in a short read case on reconnects
CIFS: Improve indentation in cifs_user_read()
CIFS: Fix possible buffer corruption in cifs_user_read()
CIFS: Count got bytes in read_into_pages()
CIFS: Use separate var for the number of bytes got in async read
CIFS: Indicate reconnect with ECONNABORTED error code
CIFS: Use multicredits for SMB 2.1/3 reads
CIFS: Fix rsize usage for sync read
CIFS: Fix rsize usage in user read
CIFS: Separate page reading from user read
CIFS: Fix rsize usage in readpages
CIFS: Separate page search from readpages
CIFS: Use multicredits for SMB 2.1/3 writes
...
AMD-compatible CFI driver:
- Support OTP programming for Micron M29EW family
- Increase buffer write timeout, according to detected flash parameter info
NAND
- Add helpers for retrieving ONFI timing modes
- GPMI: provide option to disable bad block marker swapping (required for
Ka-On electronics platforms)
SPI NOR
- EON EN25QH128 support
- Support new Flag Status Register (FSR) on a few Micron flash
Common
- New sysfs entries for bad block and ECC stats
And a few miscellaneous refactorings, cleanups, and driver improvements
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJT5WCXAAoJEFySrpd9RFgt0rwP/1anAulAcve/QzVF9LDFPec8
jSvK8WWFcHLVb9EvTHtUjRz2RSRNhe0eeyEld3WpdKZZ73VVVaHnGdJv8J8Ys8jn
kfNBDfgDLFrVzycYCqjQ2gdvidCyrjQgtPP0E/Q/RN6FBur0/rp2WKoJ2FvuT6SS
kOz5f3TOe+iNtxQoJwkFvs/IjfFMThGs+YMJ8Z9s4LcJHD65T6hF+zDwl8xF2xfG
b104PsG3I58kJdYjKhRQ2/ol+YCPoVhQorhhuaqeouZum/Hb/2g3rKHVZpAv2n6m
JWnTpbdJDqGoPFVPyJr5Vm/UYOwxEBSWimuNp+2WN7EsXux1x9JZZl5+ZNUMmb4q
vxYhIDul2+Sg1lN+ruBe+xi6d8DI8Y5WIc9xJgn3YHLC8YSkiZ11bhQyyeHA9i5h
jZYKSkN/ERl8iAA4ULD6tsZv4ds8LVI/XOxrcSM7myQ4p8oY5QBxEWEuGPgyH6A6
qCVkc0TAriSPfcCBvs4o8s2uNUocq7x6ZT1xfdlJ0KVstCmeEJBJnBYwWIXR3tu7
3+I/bI41Q29ZGV8x5PEGDSgLVDNAJZnfGPuCwdMWuVD7UQuEEOJkCvy8o7C2Fold
hRXh3SlUICCGDzd0JdNmdt5hPuB0tzsG0YkRoRj2sS30TlHi77nN/m3HYi/JQ4UW
gA21laizfJ+z7g/5Kabb
=Wkmz
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20140808' of git://git.infradead.org/linux-mtd
Pull MTD updates from Brian Norris:
"AMD-compatible CFI driver:
- Support OTP programming for Micron M29EW family
- Increase buffer write timeout, according to detected flash
parameter info
NAND
- Add helpers for retrieving ONFI timing modes
- GPMI: provide option to disable bad block marker swapping (required
for Ka-On electronics platforms)
SPI NOR
- EON EN25QH128 support
- Support new Flag Status Register (FSR) on a few Micron flash
Common
- New sysfs entries for bad block and ECC stats
And a few miscellaneous refactorings, cleanups, and driver
improvements"
* tag 'for-linus-20140808' of git://git.infradead.org/linux-mtd: (31 commits)
mtd: gpmi: make blockmark swapping optional
mtd: gpmi: remove line breaks from error messages and improve wording
mtd: gpmi: remove useless (void *) type casts and spaces between type casts and variables
mtd: atmel_nand: NFC: support multiple interrupt handling
mtd: atmel_nand: implement the nfc_device_ready() by checking the R/B bit
mtd: atmel_nand: add NFC status error check
mtd: atmel_nand: make ecc parameters same as definition
mtd: nand: add ONFI timing mode to nand_timings converter
mtd: nand: define struct nand_timings
mtd: cfi_cmdset_0002: fix do_write_buffer() timeout error
mtd: denali: use 8 bytes for READID command
mtd/ftl: fix the double free of the buffers allocated in build_maps()
mtd: phram: Fix whitespace issues
mtd: spi-nor: add support for EON EN25QH128
mtd: cfi_cmdset_0002: Add support for locking OTP memory
mtd: cfi_cmdset_0002: Add support for writing OTP memory
mtd: cfi_cmdset_0002: Invalidate cache after entering/exiting OTP memory
mtd: cfi_cmdset_0002: Add support for reading OTP
mtd: spi-nor: add support for flag status register on Micron chips
mtd: Account for BBT blocks when a partition is being allocated
...
If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
- one side cannot overwrite data while the other reads it
- one side cannot shrink the buffer while the other accesses it
- one side cannot grow the buffer beyond previously set boundaries
If there is a trust-relationship between both parties, there is no need
for policy enforcement. However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies. Look at the following two
use-cases:
1) A graphics client wants to share its rendering-buffer with a
graphics-server. The memory-region is allocated by the client for
read/write access and a second FD is passed to the server. While
scanning out from the memory region, the server has no guarantee that
the client doesn't shrink the buffer at any time, requiring rather
cumbersome SIGBUS handling.
2) A process wants to perform an RPC on another process. To avoid huge
bandwidth consumption, zero-copy is preferred. After a message is
assembled in-memory and a FD is passed to the remote side, both sides
want to be sure that neither modifies this shared copy, anymore. The
source may have put sensible data into the message without a separate
copy and the target may want to parse the message inline, to avoid a
local copy.
While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.
This patch introduces the concept of SEALING. If you seal a file, a
specific set of operations is blocked on that file forever. Unlike locks,
seals can only be set, never removed. Hence, once you verified a specific
set of seals is set, you're guaranteed that no-one can perform the blocked
operations on this file, anymore.
An initial set of SEALS is introduced by this patch:
- SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
in size. This affects ftruncate() and open(O_TRUNC).
- GROW: If SEAL_GROW is set, the file in question cannot be increased
in size. This affects ftruncate(), fallocate() and write().
- WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
are possible. This affects fallocate(PUNCH_HOLE), mmap() and
write().
- SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
This basically prevents the F_ADD_SEAL operation on a file and
can be set to prevent others from adding further seals that you
don't want.
The described use-cases can easily use these seals to provide safe use
without any trust-relationship:
1) The graphics server can verify that a passed file-descriptor has
SEAL_SHRINK set. This allows safe scanout, while the client is
allowed to increase buffer size for window-resizing on-the-fly.
Concurrent writes are explicitly allowed.
2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
process can modify the data while the other side parses it.
Furthermore, it guarantees that even with writable FDs passed to the
peer, it cannot increase the size to hit memory-limits of the source
process (in case the file-storage is accounted to the source).
The new API is an extension to fcntl(), adding two new commands:
F_GET_SEALS: Return a bitset describing the seals on the file. This
can be called on any FD if the underlying file supports
sealing.
F_ADD_SEALS: Change the seals of a given file. This requires WRITE
access to the file and F_SEAL_SEAL may not already be set.
Furthermore, the underlying file must support sealing and
there may not be any existing shared mapping of that file.
Otherwise, EBADF/EPERM is returned.
The given seals are _added_ to the existing set of seals
on the file. You cannot remove seals again.
The fcntl() handler is currently specific to shmem and disabled on all
files. A file needs to explicitly support sealing for this interface to
work. A separate syscall is added in a follow-up, which creates files that
support sealing. There is no intention to support this on other
file-systems. Semantics are unclear for non-volatile files and we lack any
use-case right now. Therefore, the implementation is specific to shmem.
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch (of 6):
The i_mmap_writable field counts existing writable mappings of an
address_space. To allow drivers to prevent new writable mappings, make
this counter signed and prevent new writable mappings if it is negative.
This is modelled after i_writecount and DENYWRITE.
This will be required by the shmem-sealing infrastructure to prevent any
new writable mappings after the WRITE seal has been set. In case there
exists a writable mapping, this operation will fail with EBUSY.
Note that we rely on the fact that iff you already own a writable mapping,
you can increase the counter without using the helpers. This is the same
that we do for i_writecount.
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes checkpatch warning:
WARNING: debugfs_remove(NULL) is safe this check is probably not required
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now with 64bit bzImage and kexec tools, we support ramdisk that size is
bigger than 2g, as we could put it above 4G.
Found compressed initramfs image could not be decompressed properly. It
turns out that image length is int during decompress detection, and it
will become < 0 when length is more than 2G. Furthermore, during
decompressing len as int is used for inbuf count, that has problem too.
Change len to long, that should be ok as on 32 bit platform long is
32bits.
Tested with following compressed initramfs image as root with kexec.
gzip, bzip2, xz, lzma, lzop, lz4.
run time for populate_rootfs():
size name Nehalem-EX Westmere-EX Ivybridge-EX
9034400256 root_img : 26s 24s 30s
3561095057 root_img.lz4 : 28s 27s 27s
3459554629 root_img.lzo : 29s 29s 28s
3219399480 root_img.gz : 64s 62s 49s
2251594592 root_img.xz : 262s 260s 183s
2226366598 root_img.lzma: 386s 376s 277s
2901482513 root_img.bz2 : 635s 599s
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Rashika Kheria <rashika.kheria@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kyungsik Lee <kyungsik.lee@lge.com>
Cc: P J P <ppandit@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: "Daniel M. Weeks" <dan@danweeks.net>
Cc: Alexandre Courbot <acourbot@nvidia.com>
Cc: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add DDEBUG in Makefile when CONFIG_QNX6FS_DEBUG is set. All QNX6DEBUG
messages are replaced by pr_debug which means debugging will be emitted in
debug level only and no more in error and info levels. debug uses now
pr_fmt and __func__
QNX6DEBUG definition has been removed.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Cc: Kai Bankett <chaosman@ontika.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove "qnx6:" and "qnx6: " from each logging instruction.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Cc: Kai Bankett <chaosman@ontika.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use current logging functions.
Coalesce formats.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Cc: Kai Bankett <chaosman@ontika.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>