When a partition is not aligned by 4KB, mount -o dax succeeds,
but any read/write access to the filesystem fails, except for
metadata update.
Call bdev_dax_supported() to perform proper precondition checks
which includes this partition alignment check.
Reported-by: Micah Parrish <micah.parrish@hpe.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
DAX imposes additional requirements to a device. Add
bdev_dax_supported() which performs all the precondition checks
necessary for filesystem to mount the device with dax option.
Also add a new check to verify if a partition is aligned by 4KB.
When a partition is unaligned, any dax read/write access fails,
except for metadata update.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@fb.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
In preparation of moving DAX capability checks to the block layer
from filesystem code, add a VFS message interface that aligns with
filesystem's message format.
For instance, a vfs_msg() message followed by XFS messages in case
of a dax mount error may look like:
VFS (pmem0p1): error: unaligned partition for dax
XFS (pmem0p1): DAX unsupported by block device. Turning off DAX.
XFS (pmem0p1): Mounting V5 Filesystem
:
vfs_msg() is largely based on ext4_msg().
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@fb.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Callers of dax fault handlers must make sure these calls cannot race
with truncate. Thus it is enough to check inode size when entering the
function and we don't have to recheck it again later in the handler.
Note that inode size itself can be decreased while the fault handler
runs but filesystem locking prevents against any radix tree or block
mapping information changes resulting from the truncate and that is what
we really care about.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
dax_do_io() is calling filemap_write_and_wait() if DIO_LOCKING flags is
set. Presumably this was copied over from direct IO code. However DAX
inodes have no pagecache pages to write so the call is pointless. Remove
it.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
All the filesystems are now zeroing blocks themselves for DAX IO to avoid
races between dax_io() and dax_fault(). Remove the zeroing code from
dax_io() and add warning to catch the case when somebody unexpectedly
returns new or unwritten buffer.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Now that all filesystems zero out blocks allocated for a fault handler,
we can just remove the zeroing from the handler itself. Also add checks
that no filesystem returns to us unwritten or new buffer.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Currently ext2 zeroes any data blocks allocated for DAX inode however it
still returns them as BH_New. Thus DAX code zeroes them again in
dax_insert_mapping() which can possibly overwrite the data that has been
already stored to those blocks by a racing dax_io(). Avoid marking
pre-zeroed buffers as new.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
When zeroing allocated blocks for DAX, we accidentally zeroed only the
first allocated block instead of all of them. So far this problem is
hidden by the fact that page faults always need only a single block and
DAX write code zeroes blocks again. But the zeroing in DAX code is racy
and needs to be removed so fix the zeroing in ext2 to zero all allocated
blocks.
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
Let's try to maintain the idea that radix-tree simply implements an
abstract data type.
Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Currently ext4 treats DAX IO the same way as direct IO. I.e., it
allocates unwritten extents before IO is done and converts unwritten
extents afterwards. However this way DAX IO can race with page fault to
the same area:
ext4_ext_direct_IO() dax_fault()
dax_io()
get_block() - allocates unwritten extent
copy_from_iter_pmem()
get_block() - converts
unwritten block to
written and zeroes it
out
ext4_convert_unwritten_extents()
So data written with DAX IO gets lost. Similarly dax_new_buf() called
from dax_io() can overwrite data that has been already written to the
block via mmap.
Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
use them for DAX mmap. The downside of this solution is that every
allocating write writes each block twice (once zeros, once data). Fixing
the race with locking is possible as well however we would need to
lock-out faults for the whole range written to by DAX IO. And that is
not easy to do without locking-out faults for the whole file which seems
too aggressive.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently ext4 direct IO handling is split between ext4_ext_direct_IO()
and ext4_ind_direct_IO(). However the extent based function calls into
the indirect based one for some cases and for example it is not able to
handle file extending. Previously it was not also properly handling
retries in case of ENOSPC errors. With DAX things would get even more
contrieved so just refactor the direct IO code and instead of indirect /
extent split do the split to read vs writes.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When there are blocks to free in the running transaction, block
allocator can return ENOSPC although the filesystem has some blocks to
free. We use ext4_should_retry_alloc() to force commit of the current
transaction and return whether anything was committed so that it makes
sense to retry the allocation. However the transaction may get committed
after block allocation fails but before we call
ext4_should_retry_alloc(). So ext4_should_retry_alloc() returns false
because there is nothing to commit and we wrongly return ENOSPC.
Fix the race by unconditionally returning 1 from ext4_should_retry_alloc()
when we tried to commit a transaction. This should not add any
unnecessary retries since we had a transaction running a while ago when
trying to allocate blocks and we want to retry the allocation once that
transaction has committed anyway.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_dax_get_blocks() was accidentally omitted fixing get blocks
handlers to properly handle transient ENOSPC errors. Fix it now to use
ext4_get_blocks_trans() helper which takes care of these errors.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, __dax_fault() does not call get_blocks() callback with create
argument set, when we got back unwritten extent from the initial
get_blocks() call during a write fault. This is because originally
filesystems were supposed to convert unwritten extents to written ones
using complete_unwritten() callback. Later this was abandoned in favor of
using pre-zeroed blocks however the condition whether get_blocks() needs
to be called with create == 1 remained.
Fix the condition so that filesystems are not forced to zero-out and
convert unwritten extents when get_blocks() is called with create == 0
(which introduces unnecessary overhead for read faults and can be
problematic as the filesystem may possibly be read-only).
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_find_extent(), stripped down to the parts relevant to this patch,
reads as
ppos = 0;
i = depth;
while (i) {
--i;
++ppos;
if (unlikely(ppos > depth)) {
...
ret = -EFSCORRUPTED;
goto err;
}
}
Due to the loop's bounds, the condition ppos > depth can never be met.
Remove this dead code.
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_io_submit() used to check for EOPNOTSUPP after bio submission,
which is why it had to get an extra reference to the bio before
submitting it. But since we no longer touch the bio after submission,
get rid of the redundant get/put of the bio. If we do get the extra
reference, we enter the slower path of having to flag this bio as now
having external references.
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, in ext4_mb_init(), there's a loop like the following:
do {
...
offset += 1 << (sb->s_blocksize_bits - i);
i++;
} while (i <= sb->s_blocksize_bits + 1);
Note that the updated offset is used in the loop's next iteration only.
However, at the last iteration, that is at i == sb->s_blocksize_bits + 1,
the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3))
and UBSAN reports
UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15
shift exponent 4294967295 is too large for 32-bit type 'int'
[...]
Call Trace:
[<ffffffff818c4d25>] dump_stack+0xbc/0x117
[<ffffffff818c4c69>] ? _atomic_dec_and_lock+0x169/0x169
[<ffffffff819411ab>] ubsan_epilogue+0xd/0x4e
[<ffffffff81941cac>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
[<ffffffff81941ab1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
[<ffffffff814b6dc1>] ? kmem_cache_alloc+0x101/0x390
[<ffffffff816fc13b>] ? ext4_mb_init+0x13b/0xfd0
[<ffffffff814293c7>] ? create_cache+0x57/0x1f0
[<ffffffff8142948a>] ? create_cache+0x11a/0x1f0
[<ffffffff821c2168>] ? mutex_lock+0x38/0x60
[<ffffffff821c23ab>] ? mutex_unlock+0x1b/0x50
[<ffffffff814c26ab>] ? put_online_mems+0x5b/0xc0
[<ffffffff81429677>] ? kmem_cache_create+0x117/0x2c0
[<ffffffff816fcc49>] ext4_mb_init+0xc49/0xfd0
[...]
Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1.
Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of offset is never used again.
Silence UBSAN by introducing another variable, offset_incr, holding the
next increment to apply to offset and adjust that one by right shifting it
by one position per loop iteration.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161
Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, in mb_find_order_for_block(), there's a loop like the following:
while (order <= e4b->bd_blkbits + 1) {
...
bb += 1 << (e4b->bd_blkbits - order);
}
Note that the updated bb is used in the loop's next iteration only.
However, at the last iteration, that is at order == e4b->bd_blkbits + 1,
the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports
UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11
shift exponent -1 is negative
[...]
Call Trace:
[<ffffffff818c4d35>] dump_stack+0xbc/0x117
[<ffffffff818c4c79>] ? _atomic_dec_and_lock+0x169/0x169
[<ffffffff819411bb>] ubsan_epilogue+0xd/0x4e
[<ffffffff81941cbc>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
[<ffffffff81941ac1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
[<ffffffff816e93a0>] ? ext4_mb_generate_from_pa+0x590/0x590
[<ffffffff816502c8>] ? ext4_read_block_bitmap_nowait+0x598/0xe80
[<ffffffff816e7b7e>] mb_find_order_for_block+0x1ce/0x240
[...]
Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of bb is never used again.
Silence UBSAN by introducing another variable, bb_incr, holding the next
increment to apply to bb and adjust that one by right shifting it by one
position per loop iteration.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161
Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
A failed call to dqget() returns an ERR_PTR() and not null. Fix
the check in ext4_ioctl_setproject() to handle this correctly.
Fixes: 9b7365fc1c ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support")
Cc: stable@vger.kernel.org # v4.5
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Instead of just printing warning messages, if the orphan list is
corrupted, declare the file system is corrupted. If there are any
reserved inodes in the orphaned inode list, declare the file system
corrupted and stop right away to avoid doing more potential damage to
the file system.
Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If the orphaned inode list contains inode #5, ext4_iget() returns a
bad inode (since the bootloader inode should never be referenced
directly). Because of the bad inode, we end up processing the inode
repeatedly and this hangs the machine.
This can be reproduced via:
mke2fs -t ext4 /tmp/foo.img 100
debugfs -w -R "ssv last_orphan 5" /tmp/foo.img
mount -o loop /tmp/foo.img /mnt
(But don't do this if you are using an unpatched kernel if you care
about the system staying functional. :-)
This bug was found by the port of American Fuzzy Lop into the kernel
to find file system problems[1]. (Since it *only* happens if inode #5
shows up on the orphan list --- 3, 7, 8, etc. won't do it, it's not
surprising that AFL needed two hours before it found it.)
[1] http://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016_0.pdf
Cc: stable@vger.kernel.org
Reported by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Messages passed to ext4_warning() or ext4_error() don't need trailing
newlines, because these function add the newlines themselves.
Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
In ext4, there is a race condition between changing inode journal mode
and ext4_writepages(). While ext4_writepages() is executed on a
non-journalled mode inode, the inode's journal mode could be enabled
by ioctl() and then, some pages dirtied after switching the journal
mode will be still exposed to ext4_writepages() in non-journaled mode.
To resolve this problem, we use fs-wide per-cpu rw semaphore by Jan
Kara's suggestion because we don't want to waste ext4_inode_info's
space for this extra rare case.
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
We already allocate delalloc blocks before changing the inode mode into
"per-file data journal" mode to prevent delalloc blocks from remaining
not allocated, but another issue concerned with "BH_Unwritten" status
still exists. For example, by fallocate(), several buffers' status
change into "BH_Unwritten", but these buffers cannot be processed by
ext4_alloc_da_blocks(). So, they still remain in unwritten status after
per-file data journaling is enabled and they cannot be changed into
written status any more and, if they are journaled and eventually
checkpointed, these unwritten buffer will cause a kernel panic by the
below BUG_ON() function of submit_bh_wbc() when they are submitted
during checkpointing.
static int submit_bh_wbc(int rw, struct buffer_head *bh,...
{
...
BUG_ON(buffer_unwritten(bh));
Moreover, when "dioread_nolock" option is enabled, the status of a
buffer is changed into "BH_Unwritten" after write_begin() completes and
the "BH_Unwritten" status will be cleared after I/O is done. Therefore,
if a buffer's status is changed into unwrutten but the buffer's I/O is
not submitted and completed, it can cause the same problem after
enabling per-file data journaling. You can easily generate this bug by
executing the following command.
./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269
To resolve these problems and define a boundary between the previous
mode and per-file data journaling mode, we need to flush and wait all
the I/O of buffers of a file before enabling per-file data journaling
of the file.
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
The function jbd2_journal_extend() takes as its argument the number of
new credits to be added to the handle. We weren't taking into account
the currently unused handle credits; worse, we would try to extend the
handle by N credits when it had N credits available.
In the case where jbd2_journal_extend() fails because the transaction
is too large, when jbd2_journal_restart() gets called, the N credits
owned by the handle gets returned to the transaction, and the
transaction commit is asynchronously requested, and then
start_this_handle() will be able to successfully attach the handle to
the current transaction since the required credits are now available.
This is mostly harmless, but since ext4_ext_truncate_extend_restart()
returns EAGAIN, the truncate machinery will once again try to call
ext4_ext_truncate_extend_restart(), which will do the above sequence
over and over again until the transaction has committed.
This was found while I was debugging a lockup in caused by running
xfstests generic/074 in the data=journal case. I'm still not sure why
we ended up looping forever, which suggests there may still be another
bug hiding in the transaction accounting machinery, but this commit
prevents us from looping in the first place.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently we ask jbd2 to write all dirty allocated buffers before
committing a transaction when doing writeback of delay allocated blocks.
However this is unnecessary since we move all pages to writeback state
before dropping a transaction handle and then submit all the necessary
IO. We still need the transaction commit to wait for all the outstanding
writeback before flushing disk caches during transaction commit to avoid
data exposure issues though. Use the new jbd2 capability and ask it to
only wait for outstanding writeback during transaction commit when
writing back data in ext4_writepages().
Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently when filesystem needs to make sure data is on permanent
storage before committing a transaction it adds inode to transaction's
inode list. During transaction commit, jbd2 writes back all dirty
buffers that have allocated underlying blocks and waits for the IO to
finish. However when doing writeback for delayed allocated data, we
allocate blocks and immediately submit the data. Thus asking jbd2 to
write dirty pages just unnecessarily adds more work to jbd2 possibly
writing back other redirtied blocks.
Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
outstanding data writes before committing a transaction and thus avoid
unnecessary writes.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This flag is just duplicating what ext4_should_order_data() tells you
and is used in a single place. Furthermore it doesn't reflect changes to
inode data journalling flag so it may be possibly misleading. Just
remove it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Huang has reported that in his powerfail testing he is seeing stale
block contents in some of recently allocated blocks although he mounts
ext4 in data=ordered mode. After some investigation I have found out
that indeed when delayed allocation is used, we don't add inode to
transaction's list of inodes needing flushing before commit. Originally
we were doing that but commit f3b59291a6 removed the logic with a
flawed argument that it is not needed.
The problem is that although for delayed allocated blocks we write their
contents immediately after allocating them, there is no guarantee that
the IO scheduler or device doesn't reorder things and thus transaction
allocating blocks and attaching them to inode can reach stable storage
before actual block contents. Actually whenever we attach freshly
allocated blocks to inode using a written extent, we should add inode to
transaction's ordered inode list to make sure we properly wait for block
contents to be written before committing the transaction. So that is
what we do in this patch. This also handles other cases where stale data
exposure was possible - like filling hole via mmap in
data=ordered,nodelalloc mode.
The only exception to the above rule are extending direct IO writes where
blkdev_direct_IO() waits for IO to complete before increasing i_size and
thus stale data exposure is not possible. For now we don't complicate
the code with optimizing this special case since the overhead is pretty
low. In case this is observed to be a performance problem we can always
handle it using a special flag to ext4_map_blocks().
CC: stable@vger.kernel.org
Fixes: f3b59291a6
Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If a directory has a large number of empty blocks, iterating over all
of them can take a long time, leading to scheduler warnings and users
getting irritated when they can't kill a process in the middle of one
of these long-running readdir operations. Fix this by adding checks to
ext4_readdir() and ext4_htree_fill_tree().
This was reverted earlier due to a typo in the original commit where I
experimented with using signal_pending() instead of
fatal_signal_pending(). The test was in the wrong place if we were
going to return signal_pending() since we would end up returning
duplicant entries. See 9f2394c9be for a more detailed explanation.
Added fix as suggested by Linus to check for signal_pending() in
in the filldir() functions.
Reported-by: Benjamin LaHaise <bcrl@kvack.org>
Google-Bug-Id: 27880676
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Here are 3 small fixes 4.6-rc4. Two fix up some lz4 issues with big
endian systems, and the remaining one resolves a minor debugfs issue
that was reported.
All have been in linux-next with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlcS3ucACgkQMUfUDdst+ynw1QCbBGgbY7Xt08whNFcAP81z1Q5X
fmsAn1fyrhfsxe+JnybzswsTOjFw99Xd
=YX6h
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull misc fixes from Greg KH:
"Here are three small fixes for 4.6-rc4.
Two fix up some lz4 issues with big endian systems, and the remaining
one resolves a minor debugfs issue that was reported.
All have been in linux-next with no reported issues"
* tag 'driver-core-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
lib: lz4: cleanup unaligned access efficiency detection
lib: lz4: fixed zram with lz4 on big endian machines
debugfs: Make automount point inodes permanently empty
These patches fix f2fs and fscrypto based on -rc3 bug fixes in ext4 crypto,
which have not yet been fully propagated as follows.
- use of dget_parent and file_dentry to avoid crashes
- disallow RCU-mode lookup in d_invalidate
- disallow -ENOMEM in the core data encryption path
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXDbkhAAoJEEAUqH6CSFDS6yMP/1L+1PcmTJKloe0wCft57fL0
hXRXI57DeNwCxc9fKmaBnAn9uv96qc7fWq1oMqT0oYPTf8qGUaIc0bALxx/IFCwh
Ul1QOzlGJmomPq4kGleOEPmDGBGUwDYgH55rEadhmevbOMX5IGQOSDGu9qxZ+LMA
orYTYTHX5tjeR1UAxzjMkwa0WRyYJGRsqqycgIl8f/kIg06GIpxOx1IIuFQOBv1N
BJlkKbbiX/snWKXXVCNlFzVAZOWjpgzfvCGB0Fge4HZlMZL876jKIj+4dVmQ5xiI
K6i5NXk1U13hhQEFzK4McicHBwlMnMxeMbdR8wm1vQ/4UfQFZoQAeClAxz5/ZyPr
lHi+VamyTbMfRwijp49hNQ8SKjMKPd70yFd2d2csQAwVra2t4fmXqieh6AEppkJb
EDeFAK8Y+PGis6XVnAJ1wkXW7OSwzfJUVAmj3oFsZ7Onny+mm++f+i7a7I3RKUib
SH23Imv1OQaoGcAFVSCYe/JP9dcfKUwzSFFYOiQneqdNTNN8rzoPYYhQ4hNCKv6/
zHSEI2EcpGipmSTh4dROPJvNf42QG8Sw4lEuyGsXy251t7wxBrc7e43BzXtjyKsl
/2d5hgAQbR9HeU3Oz41Dq/LEjReEAcdJDKMETP9/qrzs7qDJgtI0XoUkVSn3NyOR
i3JSEix2JljR97DzUIW/
=ML1+
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs/fscrypto fixes from Jaegeuk Kim:
"In addition to f2fs/fscrypto fixes, I've added one patch which
prevents RCU mode lookup in d_revalidate, as Al mentioned.
These patches fix f2fs and fscrypto based on -rc3 bug fixes in ext4
crypto, which have not yet been fully propagated as follows.
- use of dget_parent and file_dentry to avoid crashes
- disallow RCU-mode lookup in d_invalidate
- disallow -ENOMEM in the core data encryption path"
* tag 'for-linus-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
ext4/fscrypto: avoid RCU lookup in d_revalidate
fscrypto: don't let data integrity writebacks fail with ENOMEM
f2fs: use dget_parent and file_dentry in f2fs_file_open
fscrypto: use dget_parent() in fscrypt_d_revalidate()
A lot of seqfile users seem to be using things like %pK that uses the
credentials of the current process, but that is actually completely
wrong for filesystem interfaces.
The unix semantics for permission checking files is to check permissions
at _open_ time, not at read or write time, and that is not just a small
detail: passing off stdin/stdout/stderr to a suid application and making
the actual IO happen in privileged context is a classic exploit
technique.
So if we want to be able to look at permissions at read time, we need to
use the file open credentials, not the current ones. Normal file
accesses can just use "f_cred" (or any of the helper functions that do
that, like file_ns_capable()), but the seqfile interfaces do not have
any such options.
It turns out that seq_file _does_ save away the user_ns information of
the file, though. Since user_ns is just part of the full credential
information, replace that special case with saving off the cred pointer
instead, and suddenly seq_file has all the permission information it
needs.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Al pointed, d_revalidate should return RCU lookup before using d_inode.
This was originally introduced by:
commit 34286d6662 ("fs: rcu-walk aware d_revalidate method").
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: stable <stable@vger.kernel.org>
Starting with 4.1 the tracing subsystem has its own filesystem
which is automounted in the tracing subdirectory of debugfs.
Prior to this debugfs could be bind mounted in a cloned mount
namespace, but if tracefs has been mounted under debugfs this
now fails because there is a locked child mount. This creates
a regression for container software which bind mounts debugfs
to satisfy the assumption of some userspace software.
In other pseudo filesystems such as proc and sysfs we're already
creating mountpoints like this in such a way that no dirents can
be created in the directories, allowing them to be exceptions to
some MNT_LOCKED tests. In fact we're already do this for the
tracefs mountpoint in sysfs.
Do the same in debugfs_create_automount(), since the intention
here is clearly to create a mountpoint. This fixes the regression,
as locked child mounts on permanently empty directories do not
cause a bind mount to fail.
Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes the issue introduced by the ext4 crypto fix in a same manner.
For F2FS, however, we flush the pending IOs and wait for a while to acquire free
memory.
Fixes: c9af28fdd4 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM")
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch synced with the below two ext4 crypto fixes together.
In 4.6-rc1, f2fs newly introduced accessing f_path.dentry which crashes
overlayfs. To fix, now we need to use file_dentry() to access that field.
Fixes: c0a37d4878 ("ext4: use file_dentry()")
Fixes: 9dd78d8c9a ("ext4: use dget_parent() in ext4_file_open()")
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch updates fscrypto along with the below ext4 crypto change.
Fixes: 3d43bcfef5 ("ext4 crypto: use dget_parent() in ext4_d_revalidate()")
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts commit 1028b55baf.
It's broken: it makes ext4 return an error at an invalid point, causing
the readdir wrappers to write the the position of the last successful
directory entry into the position field, which means that the next
readdir will now return that last successful entry _again_.
You can only return fatal errors (that terminate the readdir directory
walk) from within the filesystem readdir functions, the "normal" errors
(that happen when the readdir buffer fills up, for example) happen in
the iterorator where we know the position of the actual failing entry.
I do have a very different patch that does the "signal_pending()"
handling inside the iterator function where it is allowable, but while
that one passes all the sanity checks, I screwed up something like four
times while emailing it out, so I'm not going to commit it today.
So my track record is not good enough, and the stars will have to align
better before that one gets committed. And it would be good to get some
review too, of course, since celestial alignments are always an iffy
debugging model.
IOW, let's just revert the commit that caused the problem for now.
Reported-by: Greg Thelen <gthelen@google.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull btrfs fixes from Chris Mason:
"These are bug fixes, including a really old fsync bug, and a few trace
points to help us track down problems in the quota code"
* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix file/data loss caused by fsync after rename and new inode
btrfs: Reset IO error counters before start of device replacing
btrfs: Add qgroup tracing
Btrfs: don't use src fd for printk
btrfs: fallback to vmalloc in btrfs_compare_tree
btrfs: handle non-fatal errors in btrfs_qgroup_inherit()
btrfs: Output more info for enospc_debug mount option
Btrfs: fix invalid reference in replace_path
Btrfs: Improve FL_KEEP_SIZE handling in fallocate
Cleanups:
- remove an unused variable from orangefs_readdir.
- clean up printk wrapper used for ofs "gossip" debugging.
- clean up truncate ctime and mtime setting in inode.c
- remove a useless null check found by coccinelle.
- optimize some memcpy/memset boilerplate code.
- remove some useless sanity checks from xattr.c
Fix:
- fix a potential strncpy vulnerability.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJXCAvtAAoJEM9EDqnrzg2+U3kP/RcypLnTdnOLFI7GtTr7erpc
4ys4UE3CdvOdIkDNgTeW+qTvy/b7qo3JBdaRqMAAmnWkxWn4cGXDGLhmfKD/6a3Q
LL9vKxiczYN8gAYgnxoF5rNwoVScFVD7PospKkWMPedkRu00K0dqiJvomPXsvwM7
MEl4ueewq3KMrXHu1GYSdoZhU8BWtku789Oa5leXRqEO2pRit0amX/qxB9KG6Urz
Ebz0yCGMPP61nTN9A/Q7IM2imcJO5F0wc5uaSaWimjHeHb040ytMVqFa4GOh4Pky
oQjjw5OYSsX/O7Hh66wKQ68YqYb762OKE1y5v8K43vRhWQtnQo9pdKpFVEK0An8Z
wOSaUFd5mhPD2KENcKB/kSiH3eOyGdyD2QamptLJ6Opl/6UdPQMt++1EkuSuYEdW
wnCFsJM5yam3Ot+REnQiYAVjLsDZ0XhPfNIuAp+d4LIV32JGTQPOBurKECwsJbj5
fK0lsBNk7b8qgBwG41JjV7XyGlf5HWT2TwpuC2S5PysVXcxsvdfR3oVeUnzdL6aB
fv0mp7bBqg2lNLKoeEYlxb+vbSiLutOCIPWSNptYjbCWa/q/bqnZMhXOUQjS5XRj
HIG/91XeRUxOZk+y1gs5N5F8pCT9mfWRTvSCX5ex5x17rUFto/jhn2mQLAhxSyPD
/AGQgNGz2VmmVBlaCngI
=ZRlK
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.6-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs fixes from Mike Marshall:
"Orangefs cleanups and a strncpy vulnerability fix.
Cleanups:
- remove an unused variable from orangefs_readdir.
- clean up printk wrapper used for ofs "gossip" debugging.
- clean up truncate ctime and mtime setting in inode.c
- remove a useless null check found by coccinelle.
- optimize some memcpy/memset boilerplate code.
- remove some useless sanity checks from xattr.c
Fix:
- fix a potential strncpy vulnerability"
* tag 'for-linus-4.6-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: remove unused variable
orangefs: Add KERN_<LEVEL> to gossip_<level> macros
orangefs: strncpy -> strscpy
orangefs: clean up truncate ctime and mtime setting
Orangefs: fix ifnullfree.cocci warnings
Orangefs: optimize boilerplate code.
Orangefs: xattr.c cleanup
Emit the logging messages at the appropriate levels.
Miscellanea:
o Change format to fmt
o Use the more common ##__VA_ARGS__
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
It would have been possible for a rogue client-core to send in a symlink
target which is not NUL terminated. This returns EIO if the client-core
gives us corrupt data.
Leave debugfs and superblock code as is for now.
Other dcache.c and namei.c strncpy instances are safe because
ORANGEFS_NAME_MAX = NAME_MAX + 1; there is always enough space for a
name plus a NUL byte.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
The ctime and mtime are always updated on a successful ftruncate and
only updated on a successful truncate where the size changed.
We handle the ``if the size changed'' bit.
This matches FUSE's behavior.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
fs/orangefs/orangefs-debugfs.c:130:2-26: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values.
NULL check before some freeing functions is not needed.
Based on checkpatch warning
"kfree(NULL) is safe this check is probably not required"
and kfreeaddr.cocci by Julia Lawall.
Generated by: scripts/coccinelle/free/ifnullfree.cocci
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Suggested by David Binderman <dcb314@hotmail.com>
The former can potentially be a performance win over the latter.
memcpy(d, s, len);
memset(d+len, c, size-len);
memset(d, c, size);
memcpy(d, s, len);
Signed-off-by: Mike Marshall <hubcap@omnibond.com>