If ->writepage() tries to write back a page whose copy is still in flight,
then just skip by calling redirty_page_for_writepage().
This is OK, since now ->writepage() should never be called for data
integrity sync.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
As Maxim Patlasov pointed out, it's possible to get a dirty page while it's
copy is still under writeback, despite fuse_page_mkwrite() doing its thing
(direct IO).
This could result in two concurrent write request for the same offset, with
data corruption if they get mixed up.
To prevent this, fuse needs to check and delay such writes. This
implementation does this by:
1. check if page is still under writeout, if so create a new, single page
secondary request for it
2. chain this secondary request onto the in-flight request
2/a. if a seconday request for the same offset was already chained to the
in-flight request, then just copy the contents of the page and discard
the new secondary request. This makes sure that for each page will
have at most two requests associated with it
3. when the in-flight request finished, send off all secondary requests
chained onto it
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Checking against tmp-page indexes is not very useful, and results in one
(or rarely two) page requests. Which is not much of an improvement...
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch fixes a race between ftruncate(2), mmap-ed write and write(2):
1) An user makes a page dirty via mmap-ed write.
2) The user performs shrinking truncate(2) intended to purge the page.
3) Before fuse_do_setattr calls truncate_pagecache, the page goes to
writeback. fuse_writepages_fill attaches a new page to FUSE_WRITE request,
then releases the original page by end_page_writeback and unlock it.
4) fuse_do_setattr completes and successfully returns. Since now, i_mutex
is free.
5) Ordinary write(2) extends i_size back to cover the page. Note that
fuse_send_write_pages do wait for fuse writeback, but for another
page->index.
6) fuse_writepages_fill attaches more pages to the request (if any), then
fuse_writepages_send is eventually called. It is supposed to crop
inarg->size of the request, but it doesn't because i_size has already been
extended back.
Moving end_page_writeback behind fuse_writepages_send guarantees that
__fuse_release_nowrite (called from fuse_do_setattr) will crop inarg->size
of the request before write(2) gets the chance to extend i_size.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The .writepages one is required to make each writeback request carry more than
one page on it. The patch enables optimized behaviour unconditionally,
i.e. mmap-ed writes will benefit from the patch even if fc->writeback_cache=0.
[SzM: simplify, add comments]
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Don't bug if there's no writable files found for page writeback. If ever
this is triggered, a WARN_ON helps debugging it much better then a BUG_ON.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Lock the page in fuse_page_mkwrite() to protect against a race with
fuse_writepage() where the page is redirtied before the actual writeback
begins.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The .writepages callback will issue writeback requests with more than one
page aboard. Make existing end/check code be aware of this.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
There will be a .writepageS callback implementation which will need to
get a fuse_file out of a fuse_inode, thus make a helper for this.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
A former patch introducing FUSE_I_SIZE_UNSTABLE flag provided detailed
description of races between ftruncate and anyone who can extend i_size:
> 1. As in the previous scenario fuse_dentry_revalidate() discovered that i_size
> changed (due to our own fuse_do_setattr()) and is going to call
> truncate_pagecache() for some 'new_size' it believes valid right now. But by
> the time that particular truncate_pagecache() is called ...
> 2. fuse_do_setattr() returns (either having called truncate_pagecache() or
> not -- it doesn't matter).
> 3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
> 4. mmap-ed write makes a page in the extended region dirty.
This patch adds necessary bits to fuse_file_fallocate() to protect from that
race.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
The patch fixes a race between mmap-ed write and fallocate(PUNCH_HOLE):
1) An user makes a page dirty via mmap-ed write.
2) The user performs fallocate(2) with mode == PUNCH_HOLE|KEEP_SIZE
and <offset, size> covering the page.
3) Before truncate_pagecache_range call from fuse_file_fallocate,
the page goes to write-back. The page is fully processed by fuse_writepage
(including end_page_writeback on the page), but fuse_flush_writepages did
nothing because fi->writectr < 0.
4) truncate_pagecache_range is called and fuse_file_fallocate is finishing
by calling fuse_release_nowrite. The latter triggers processing queued
write-back request which will write stale data to the hole soon.
Changed in v2 (thanks to Brian for suggestion):
- Do not truncate page cache until FUSE_FALLOCATE succeeded. Otherwise,
we can end up in returning -ENOTSUPP while user data is already punched
from page cache. Use filemap_write_and_wait_range() instead.
Changed in v3 (thanks to Miklos for suggestion):
- fuse_wait_on_writeback() is prone to livelocks; use fuse_set_nowrite()
instead. So far as we need a dirty-page barrier only, fuse_sync_writes()
should be enough.
- rebased to for-linus branch of fuse.git
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
The way how fuse calls truncate_pagecache() from fuse_change_attributes()
is completely wrong. Because, w/o i_mutex held, we never sure whether
'oldsize' and 'attr->size' are valid by the time of execution of
truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we
released fc->lock in the middle of fuse_change_attributes(), we completely
loose control of actions which may happen with given inode until we reach
truncate_pagecache. The list of potentially dangerous actions includes
mmap-ed reads and writes, ftruncate(2) and write(2) extending file size.
The typical outcome of doing truncate_pagecache() with outdated arguments
is data corruption from user point of view. This is (in some sense)
acceptable in cases when the issue is triggered by a change of the file on
the server (i.e. externally wrt fuse operation), but it is absolutely
intolerable in scenarios when a single fuse client modifies a file without
any external intervention. A real life case I discovered by fsx-linux
looked like this:
1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends
FUSE_SETATTR to the server synchronously, but before getting fc->lock ...
2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP
to the server synchronously, then calls fuse_change_attributes(). The
latter updates i_size, releases fc->lock, but before comparing oldsize vs
attr->size..
3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and
updating attributes and i_size, but now oldsize is equal to
outarg.attr.size because i_size has just been updated (step 2). Hence,
fuse_do_setattr() returns w/o calling truncate_pagecache().
4. As soon as ftruncate(2) completes, the user extends file size by
write(2) making a hole in the middle of file, then reads data from the hole
either by read(2) or mmap-ed read. The user expects to get zero data from
the hole, but gets stale data because truncate_pagecache() is not executed
yet.
The scenario above illustrates one side of the problem: not truncating the
page cache even though we should. Another side corresponds to truncating
page cache too late, when the state of inode changed significantly.
Theoretically, the following is possible:
1. As in the previous scenario fuse_dentry_revalidate() discovered that
i_size changed (due to our own fuse_do_setattr()) and is going to call
truncate_pagecache() for some 'new_size' it believes valid right now. But
by the time that particular truncate_pagecache() is called ...
2. fuse_do_setattr() returns (either having called truncate_pagecache() or
not -- it doesn't matter).
3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
4. mmap-ed write makes a page in the extended region dirty.
The result will be the lost of data user wrote on the fourth step.
The patch is a hotfix resolving the issue in a simplistic way: let's skip
dangerous i_size update and truncate_pagecache if an operation changing
file size is in progress. This simplistic approach looks correct for the
cases w/o external changes. And to handle them properly, more sophisticated
and intrusive techniques (e.g. NFS-like one) would be required. I'd like to
postpone it until the issue is well discussed on the mailing list(s).
Changed in v2:
- improved patch description to cover both sides of the issue.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
The patch fixes a race between ftruncate(2), mmap-ed write and write(2):
1) An user makes a page dirty via mmap-ed write.
2) The user performs shrinking truncate(2) intended to purge the page.
3) Before fuse_do_setattr calls truncate_pagecache, the page goes to
writeback. fuse_writepage_locked fills FUSE_WRITE request and releases
the original page by end_page_writeback.
4) fuse_do_setattr() completes and successfully returns. Since now, i_mutex
is free.
5) Ordinary write(2) extends i_size back to cover the page. Note that
fuse_send_write_pages do wait for fuse writeback, but for another
page->index.
6) fuse_writepage_locked proceeds by queueing FUSE_WRITE request.
fuse_send_writepage is supposed to crop inarg->size of the request,
but it doesn't because i_size has already been extended back.
Moving end_page_writeback to the end of fuse_writepage_locked fixes the
race because now the fact that truncate_pagecache is successfully returned
infers that fuse_writepage_locked has already called end_page_writeback.
And this, in turn, infers that fuse_flush_writepages has already called
fuse_send_writepage, and the latter used valid (shrunk) i_size. write(2)
could not extend it because of i_mutex held by ftruncate(2).
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
Pull second set of VFS changes from Al Viro:
"Assorted f_pos race fixes, making do_splice_direct() safe to call with
i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
->d_hash/->d_compare calling conventions changes from Linus, misc
stuff all over the place."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
Document ->tmpfile()
ext4: ->tmpfile() support
vfs: export lseek_execute() to modules
lseek_execute() doesn't need an inode passed to it
block_dev: switch to fixed_size_llseek()
cpqphp_sysfs: switch to fixed_size_llseek()
tile-srom: switch to fixed_size_llseek()
proc_powerpc: switch to fixed_size_llseek()
ubi/cdev: switch to fixed_size_llseek()
pci/proc: switch to fixed_size_llseek()
isapnp: switch to fixed_size_llseek()
lpfc: switch to fixed_size_llseek()
locks: give the blocked_hash its own spinlock
locks: add a new "lm_owner_key" lock operation
locks: turn the blocked_list into a hashtable
locks: convert fl_link to a hlist_node
locks: avoid taking global lock if possible when waking up blocked waiters
locks: protect most of the file_lock handling with i_lock
locks: encapsulate the fl_link list handling
locks: make "added" in __posix_lock_file a bool
...
Changing size of a file on server and local update (fuse_write_update_size)
should be always protected by inode->i_mutex. Otherwise a race like this is
possible:
1. Process 'A' calls fallocate(2) to extend file (~FALLOC_FL_KEEP_SIZE).
fuse_file_fallocate() sends FUSE_FALLOCATE request to the server.
2. Process 'B' calls ftruncate(2) shrinking the file. fuse_do_setattr()
sends shrinking FUSE_SETATTR request to the server and updates local i_size
by i_size_write(inode, outarg.attr.size).
3. Process 'A' resumes execution of fuse_file_fallocate() and calls
fuse_write_update_size(inode, offset + length). But 'offset + length' was
obsoleted by ftruncate from previous step.
Changed in v2 (thanks Brian and Anand for suggestions):
- made relation between mutex_lock() and fuse_set_nowrite(inode) more
explicit and clear.
- updated patch description to use ftruncate(2) in example
Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The bug was introduced with async_dio feature: trying to optimize short reads,
we cut number-of-bytes-to-read to i_size boundary. Hence the following example:
truncate --size=300 /mnt/file
dd if=/mnt/file of=/dev/null iflag=direct
led to FUSE_READ request of 300 bytes size. This turned out to be problem
for userspace fuse implementations who rely on assumption that kernel fuse
does not change alignment of request from client FS.
The patch turns off the optimization if async_dio is disabled. And, if it's
enabled, the patch fixes adjustment of number-of-bytes-to-read to preserve
alignment.
Note, that we cannot throw out short read optimization entirely because
otherwise a direct read of a huge size issued on a tiny file would generate
a huge amount of fuse requests and most of them would be ACKed by userspace
with zero bytes read.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
If request submission fails for an async request (i.e.,
get_user_pages() returns -ERESTARTSYS), we currently skip the
-EIOCBQUEUED return and drop into wait_for_sync_kiocb() forever.
Avoid this by always returning -EIOCBQUEUED for async requests. If
an error occurs, the error is passed into fuse_aio_complete(),
returned via aio_complete() and thus propagated to userspace via
io_getevents().
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
An fallocate request without FALLOC_FL_KEEP_SIZE set can extend the
size of a file. Update the inode size after a successful fallocate.
Also invalidate the inode attributes after a successful fallocate
to ensure we pick up the latest attribute values (i.e., i_blocks).
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
fuse supports hole punch via the fallocate() FALLOC_FL_PUNCH_HOLE
interface. When a hole punch is passed through, the page cache
is not cleared and thus allows reading stale data from the cache.
This is easily demonstrable (using FOPEN_KEEP_CACHE) by reading a
smallish random data file into cache, punching a hole and creating
a copy of the file. Drop caches or remount and observe that the
original file no longer matches the file copied after the hole
punch. The original file contains a zeroed range and the latter
file contains stale data.
Protect against writepage requests in progress and punch out the
associated page cache range after a successful client fs hole
punch.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Commit 8b41e671 introduced explicit background checking for fuse_req
structures with BUG_ON() checks for the appropriate type of request in
in the associated send functions. Commit bcba24cc introduced the ability
to send dio requests as background requests but does not update the
request allocation based on the type of I/O request. As a result, a
BUG_ON() triggers in the fuse_request_send_background() background path if
an async I/O is sent.
Allocate a request based on the async state of the fuse_io_priv to avoid
the BUG.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Merge more incoming from Andrew Morton:
- Various fixes which were stalled or which I picked up recently
- A large rotorooting of the AIO code. Allegedly to improve
performance but I don't really have good performance numbers (I might
have lost the email) and I can't raise Kent today. I held this out
of 3.9 and we could give it another cycle if it's all too late/scary.
I ended up taking only the first two thirds of the AIO rotorooting. I
left the percpu parts and the batch completion for later. - Linus
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (33 commits)
aio: don't include aio.h in sched.h
aio: kill ki_retry
aio: kill ki_key
aio: give shared kioctx fields their own cachelines
aio: kill struct aio_ring_info
aio: kill batch allocation
aio: change reqs_active to include unreaped completions
aio: use cancellation list lazily
aio: use flush_dcache_page()
aio: make aio_read_evt() more efficient, convert to hrtimers
wait: add wait_event_hrtimeout()
aio: refcounting cleanup
aio: make aio_put_req() lockless
aio: do fget() after aio_get_req()
aio: dprintk() -> pr_debug()
aio: move private stuff out of aio.h
aio: add kiocb_cancel()
aio: kill return value of aio_complete()
char: add aio_{read,write} to /dev/{null,zero}
aio: remove retry-based AIO
...
Pull fuse updates from Miklos Szeredi:
"This contains two patchsets from Maxim Patlasov.
The first reworks the request throttling so that only async requests
are throttled. Wakeup of waiting async requests is also optimized.
The second series adds support for async processing of direct IO which
optimizes direct IO and enables the use of the AIO userspace
interface."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: add flag to turn on async direct IO
fuse: truncate file if async dio failed
fuse: optimize short direct reads
fuse: enable asynchronous processing direct IO
fuse: make fuse_direct_io() aware about AIO
fuse: add support of async IO
fuse: move fuse_release_user_pages() up
fuse: optimize wake_up
fuse: implement exclusive wakeup for blocked_waitq
fuse: skip blocking on allocations of synchronous requests
fuse: add flag fc->initialized
fuse: make request allocations for background processing explicit
Without async DIO write requests to a single file were always serialized.
With async DIO that's no longer the case.
So don't turn on async DIO by default for fear of breaking backward
compatibility.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch improves error handling in fuse_direct_IO(): if we successfully
submitted several fuse requests on behalf of synchronous direct write
extending file and some of them failed, let's try to do our best to clean-up.
Changed in v2: reuse fuse_do_setattr(). Thanks to Brian for suggestion.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
If user requested direct read beyond EOF, we can skip sending fuse requests
for positions beyond EOF because userspace would ACK them with zero bytes read
anyway. We can trust to i_size in fuse_direct_IO for such cases because it's
called from fuse_file_aio_read() and the latter updates fuse attributes
including i_size.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
In case of synchronous DIO request (i.e. read(2) or write(2) for a file
opened with O_DIRECT), the patch submits fuse requests asynchronously, but
waits for their completions before return from fuse_direct_IO().
In case of asynchronous DIO request (i.e. libaio io_submit() or a file opened
with O_DIRECT), the patch submits fuse requests asynchronously and return
-EIOCBQUEUED immediately.
The only special case is async DIO extending file. Here the patch falls back
to old behaviour because we can't return -EIOCBQUEUED and update i_size later,
without i_mutex hold. And we have no method to wait on real async I/O
requests.
The patch also clean __fuse_direct_write() up: it's better to update i_size
in its callers. Thanks Brian for suggestion.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch implements passing "struct fuse_io_priv *io" down the stack up to
fuse_send_read/write where it is used to submit request asynchronously.
io->async==0 designates synchronous processing.
Non-trivial part of the patch is changes in fuse_direct_io(): resources
like fuse requests and user pages cannot be released immediately in async
case.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch implements a framework to process an IO request asynchronously. The
idea is to associate several fuse requests with a single kiocb by means of
fuse_io_priv structure. The structure plays the same role for FUSE as 'struct
dio' for direct-io.c.
The framework is supposed to be used like this:
- someone (who wants to process an IO asynchronously) allocates fuse_io_priv
and initializes it setting 'async' field to non-zero value.
- as soon as fuse request is filled, it can be submitted (in non-blocking way)
by fuse_async_req_send()
- when all submitted requests are ACKed by userspace, io->reqs drops to zero
triggering aio_complete()
In case of IO initiated by libaio, aio_complete() will finish processing the
same way as in case of dio_complete() calling aio_complete(). But the
framework may be also used for internal FUSE use when initial IO request
was synchronous (from user perspective), but it's beneficial to process it
asynchronously. Then the caller should wait on kiocb explicitly and
aio_complete() will wake the caller up.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
fuse_release_user_pages() will be indirectly used by fuse_send_read/write
in future patches.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
There are two types of processing requests in FUSE: synchronous (via
fuse_request_send()) and asynchronous (via adding to fc->bg_queue).
Fortunately, the type of processing is always known in advance, at the time
of request allocation. This preparatory patch utilizes this fact making
fuse_get_req() aware about the type. Next patches will use it.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
commit 626cf23660 "poll: add poll_requested_events()..." enabled us to send the
requested events to the filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Commit c69e8d9c0 added rcu lock to fuse/dir.c It was assuming
that 'task' is some other process but in fact this parameter always
equals to 'current'. Inline this parameter to make it more readable
and remove RCU lock as it is not needed when access current process
credentials.
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
__fuse_direct_io() allocates fuse-requests by calling fuse_get_req(fc, n). The
patch calculates 'n' based on iov[] array. This is useful because allocating
FUSE_MAX_PAGES_PER_REQ page pointers and descriptors for each fuse request
would be waste of memory in case of iov-s of smaller size.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Let fuse_get_user_pages() pack as many iov-s to a single fuse_req as
possible. This is very beneficial in case of iov[] consisting of many
iov-s of relatively small sizes (e.g. PAGE_SIZE).
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch makes preliminary work for the next patch optimizing scatter-gather
direct IO. The idea is to allow fuse_get_user_pages() to pack as many iov-s
to each fuse request as possible. So, here we only rework all related
call-paths to carry iov[] from fuse_direct_IO() to fuse_get_user_pages().
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Previously, anyone who set flag 'argpages' only filled req->pages[] and set
per-request page_offset. This patch re-works all cases where argpages=1 to
fill req->page_descs[] properly.
Having req->page_descs[] filled properly allows to re-work fuse_copy_pages()
to copy page fragments described by req->page_descs[]. This will be useful
for next patches optimizing direct_IO.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The ability to save page pointers along with lengths and offsets in fuse_req
will be useful to cover several iovec-s with a single fuse_req.
Per-request page_offset is removed because anybody who need it can use
req->page_descs[0].offset instead.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
fuse_do_ioctl() already calculates the number of pages it's going to use. It is
stored in 'num_pages' variable. So the patch simply uses it for allocating
fuse_req.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch allocates as many page pointers in fuse_req as needed to cover
interval [pos .. pos+len-1]. Inline helper fuse_wr_pages() is introduced
to hide this cumbersome arithmetic.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch uses 'nr_pages' argument of fuse_readpages() as heuristics for the
number of page pointers to allocate.
This can be improved further by taking in consideration fc->max_read and gaps
between page indices, but it's not clear whether it's worthy or not.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch categorizes all fuse_get_req() invocations into two categories:
- fuse_get_req_nopages(fc) - when caller doesn't care about req->pages
- fuse_get_req(fc, n) - when caller need n page pointers (n > 0)
Adding fuse_get_req_nopages() helps to avoid numerous fuse_get_req(fc, 0)
scattered over code. Now it's clear from the first glance when a caller need
fuse_req with page pointers.
The patch doesn't make any logic changes. In multi-page case, it silly
allocates array of FUSE_MAX_PAGES_PER_REQ page pointers. This will be amended
by future patches.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch removes inline array of FUSE_MAX_PAGES_PER_REQ page pointers from
fuse_req. Instead of that, req->pages may now point either to small inline
array or to an array allocated dynamically.
This essentially means that all callers of fuse_request_alloc[_nofs] should
pass the number of pages needed explicitly.
The patch doesn't make any logic changes.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Fix the following sparse warning:
fs/fuse/file.c:2249:6: warning: symbol 'fuse_file_fallocate' was not declared. Should it be static?
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
But the kernel decided to call it "origin" instead. Fix most of the
sites.
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().
Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.
Now device drivers can implement this method and obtain nonlinear vma support.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 7572777eef attempted to verify that
the total iovec from the client doesn't overflow iov_length() but it
only checked the first element. The iovec could still overflow by
starting with a small element. The obvious fix is to check all the
elements.
The overflow case doesn't look dangerous to the kernel as the copy is
limited by the length after the overflow. This fix restores the
intention of returning an error instead of successfully copying less
than the iovec represented.
I found this by code inspection. I built it but don't have a test case.
I'm cc:ing stable because the initial commit did as well.
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: <stable@vger.kernel.org> [2.6.37+]
Convert check in fuse_file_aio_write() to using new freeze protection.
CC: fuse-devel@lists.sourceforge.net
CC: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A fuse-based network filesystem might allow for the inode
and/or file data to change unexpectedly. A local client
that opens and repeatedly reads a file might never pick
up on such changes and indefinitely return stale data.
Always invoke fuse_update_attributes() in the read path
to cause an attr revalidation when the attributes expire.
This leads to a page cache invalidation if necessary and
ensures fuse issues new read requests to the fuse client.
The original logic (reval only on reads beyond EOF) is
preserved unless the client specifies FUSE_AUTO_INVAL_DATA
on init.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Btrfs has to make sure we have space to allocate new blocks in order to modify
the inode, so updating time can fail. We've gotten around this by having our
own file_update_time but this is kind of a pain, and Christoph has indicated he
would like to make xfs do something different with atime updates. So introduce
->update_time, where we will deal with i_version an a/m/c time updates and
indicate which changes need to be made. The normal version just does what it
has always done, updates the time and marks the inode dirty, and then
filesystems can choose to do something different.
I've gone through all of the users of file_update_time and made them check for
errors with the exception of the fault code since it's complicated and I wasn't
quite sure what to do there, also Jan is going to be pushing the file time
updates into page_mkwrite for those who have it so that should satisfy btrfs and
make it not a big deal to check the file_update_time() return code in the
generic fault path. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
fallocate filesystem operation preallocates media space for the given file.
If fallocate returns success then any subsequent write to the given range
never fails with 'not enough space' error.
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Pull fuse updates from Miklos Szeredi.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: use flexible array in fuse.h
fuse: allow nanosecond granularity
fuse: O_DIRECT support for files
fuse: fix nlink after unlink
Implement ->direct_IO() method in aops. The ->direct_IO() method combines
the existing fuse_direct_read/fuse_direct_write methods to implement
O_DIRECT functionality.
Reaching ->direct_IO() in the read path via generic_file_aio_read ensures
proper synchronization with page cache with its existing framework.
Reaching ->direct_IO() in the write path via fuse_file_aio_write is made
to come via generic_file_direct_write() which makes it play nice with
the page cache w.r.t other mmap pages etc.
On files marked 'direct_io' by the filesystem server, IO always follows
the fuse_direct_read/write path. There is no effect of fcntl(O_DIRECT)
and it always succeeds.
On files not marked with 'direct_io' by the filesystem server, the IO
path depends on O_DIRECT flag by the application. This can be passed
at the time of open() as well as via fcntl().
Note that asynchronous O_DIRECT iocb jobs are completed synchronously
always (this has been the case with FUSE even before this patch)
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Multiplexing filesystems may want to support ioctls on the underlying
files and directores (e.g. FS_IOC_{GET,SET}FLAGS).
Ioctl support on directories was missing so add it now.
Reported-by: Antonio SJ Musumeci <bile@landofbile.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The advantage of kcalloc is, that will prevent integer overflows which could
result from the multiplication of number of elements and size and it is also
a bit nicer to read.
The semantic patch that makes this change is available
in https://lkml.org/lkml/2011/11/25/107
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Use generic_file_llseek() instead of open coding the seek function.
i_mutex protection is only necessary for SEEK_END (and SEEK_HOLE, SEEK_DATA), so
move SEEK_CUR and SEEK_SET out from under i_mutex.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Fix race between lseek(fd, 0, SEEK_CUR) and read/write. This was fixed in
generic code by commit 5b6f1eb97d (vfs: lseek(fd, 0, SEEK_CUR) race condition).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The test in fuse_file_llseek() "not SEEK_CUR or not SEEK_SET" always evaluates
to true.
This was introduced in 3.1 by commit 06222e49 (fs: handle SEEK_HOLE/SEEK_DATA
properly in all fs's that define their own llseek) and changed the behavior of
SEEK_CUR and SEEK_SET to always retrieve the file attributes. This is a
performance regression.
Fix the test so that it makes sense.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@vger.kernel.org
CC: Josef Bacik <josef@redhat.com>
CC: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: check size of FUSE_NOTIFY_INVAL_ENTRY message
fuse: mark pages accessed when written to
fuse: delete dead .write_begin and .write_end aops
fuse: fix flock
fuse: fix non-ANSI void function notation
As fuse does not use the page cache library functions when userspace
writes to a file, it did not benefit from 'c8236db mm: mark page
accessed before we write_end()' that made sure pages are properly
marked accessed when written to.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Ever since 'ea9b990 fuse: implement perform_write', the .write_begin
and .write_end aops have been dead code.
Their task - acquiring a page from the page cache, sending out a write
request and releasing the page again - is now done batch-wise to
maximize the number of pages send per userspace request.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Commit a9ff4f87 "fuse: support BSD locking semantics" overlooked a
number of issues with supporing flock locks over existing POSIX
locking infrastructure:
- it's not backward compatible, passing flock(2) calls to userspace
unconditionally (if userspace sets FUSE_POSIX_LOCKS)
- it doesn't cater for the fact that flock locks are automatically
unlocked on file release
- it doesn't take into account the fact that flock exclusive locks
(write locks) don't need an fd opened for write.
The last one invalidates the original premise of the patch that flock
locks can be emulated with POSIX locks.
This patch fixes the first two issues. The last one needs to be fixed
in userspace if the filesystem assumed that a write lock will happen
only on a file operned for write (as in the case of the current fuse
library).
Reported-by: Sebastian Pipping <webmaster@hartwork.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* 'for-3.1' of git://linux-nfs.org/~bfields/linux:
nfsd: don't break lease on CLAIM_DELEGATE_CUR
locks: rename lock-manager ops
nfsd4: update nfsv4.1 implementation notes
nfsd: turn on reply cache for NFSv4
nfsd4: call nfsd4_release_compoundargs from pc_release
nfsd41: Deny new lock before RECLAIM_COMPLETE done
fs: locks: remove init_once
nfsd41: check the size of request
nfsd41: error out when client sets maxreq_sz or maxresp_sz too small
nfsd4: fix file leak on open_downgrade
nfsd4: remember to put RW access on stateid destruction
NFSD: Added TEST_STATEID operation
NFSD: added FREE_STATEID operation
svcrpc: fix list-corrupting race on nfsd shutdown
rpc: allow autoloading of gss mechanisms
svcauth_unix.c: quiet sparse noise
svcsock.c: include sunrpc.h to quiet sparse noise
nfsd: Remove deprecated nfsctl system call and related code.
NFSD: allow OP_DESTROY_CLIENTID to be only op in COMPOUND
Fix up trivial conflicts in Documentation/feature-removal-schedule.txt
Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases
we just return -EINVAL, in others we do the normal generic thing, and in others
we're simply making sure that the properly due-dilligence is done. For example
in NFS/CIFS we need to make sure the file size is update properly for the
SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
that is all we have to do. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Both the filesystem and the lock manager can associate operations with a
lock. Confusingly, one of them (fl_release_private) actually has the
same name in both operation structures.
It would save some confusion to give the lock-manager ops different
names.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If a fuse dev connection is broken, wake up any
processes that are blocking, in a poll system call,
on one of the files in the now defunct filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Single threaded NTFS-3G could get stuck if a delayed RELEASE reply
triggered a DESTROY request via path_put().
Fix this by
a) making RELEASE requests synchronous, whenever possible, on fuseblk
filesystems
b) if not possible (triggered by an asynchronous read/write) then do
the path_put() in a separate thread with schedule_work().
Reported-by: Oliver Neukum <oneukum@suse.de>
Cc: stable@kernel.org
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
In kernel ABI version 7.16 and later FUSE_IOCTL_RETRY reply from a
unrestricted IOCTL request shall return with an array of 'struct
fuse_ioctl_iovec' instead of 'struct iovec'. This fixes the ABI
ambiguity of 32bit vs. 64bit.
Reported-by: "ccmail111" <ccmail111@yahoo.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
Verify that the total length of the iovec returned in FUSE_IOCTL_RETRY
doesn't overflow iov_length().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: <stable@kernel.org> [2.6.31+]
If a 32bit CUSE server is run on 64bit this results in EIO being
returned to the caller.
The reason is that FUSE_IOCTL_RETRY reply was defined to use 'struct
iovec', which is different on 32bit and 64bit archs.
Work around this by looking at the size of the reply to determine
which struct was used. This is only needed if CONFIG_COMPAT is
defined.
A more permanent fix for the interface will be to use the same struct
on both 32bit and 64bit.
Reported-by: "ccmail111" <ccmail111@yahoo.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: <stable@kernel.org> [2.6.31+]
The attribute cache for a file was not being cleared when a file is opened
with O_TRUNC.
If the filesystem's open operation truncates the file ("atomic_o_trunc"
feature flag is set) then the kernel should invalidate the cached st_mtime
and st_ctime attributes.
Also i_size should be explicitly be set to zero as it is used sometimes
without refreshing the cache.
Signed-off-by: Ken Sumrall <ksumrall@android.com>
Cc: Anfei <anfei.zhou@gmail.com>
Cc: "Anand V. Avati" <avati@gluster.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sparse doesn't understand lock annotations of the form
__releases(&foo->lock). Change them to __releases(foo->lock). Same
for __acquires().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Userspace filesystem can request data to be stored in the inode's
mapping. This request is synchronous and has no reply. If the write
to the fuse device returns an error then the store request was not
fully completed (but may have updated some pages).
If the stored data overflows the current file size, then the size is
extended, similarly to a write(2) on the filesystem.
Pages which have been completely stored are marked uptodate.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
mm: export generic_pipe_buf_*() to modules
fuse: support splice() reading from fuse device
fuse: allow splice to move pages
mm: export remove_from_page_cache() to modules
mm: export lru_cache_add_*() to modules
fuse: support splice() writing to fuse device
fuse: get page reference for readpages
fuse: use get_user_pages_fast()
fuse: remove unneeded variable
When splicing buffers to the fuse device with SPLICE_F_MOVE, try to
move pages from the pipe buffer into the page cache. This allows
populating the fuse filesystem's cache without ever touching the page
contents, i.e. zero copy read capability.
The following steps are performed when trying to move a page into the
page cache:
- buf->ops->confirm() to make sure the new page is uptodate
- buf->ops->steal() to try to remove the new page from it's previous place
- remove_from_page_cache() on the old page
- add_to_page_cache_locked() on the new page
If any of the above steps fail (non fatally) then the code falls back
to copying the page. In particular ->steal() will fail if there are
external references (other than the page cache and the pipe buffer) to
the page.
Also since the remove_from_page_cache() + add_to_page_cache_locked()
are non-atomic it is possible that the page cache is repopulated in
between the two and add_to_page_cache_locked() will fail. This could
be fixed by creating a new atomic replace_page_cache_page() function.
fuse_readpages_end() needed to be reworked so it works even if
page->mapping is NULL for some or all pages which can happen if the
add_to_page_cache_locked() failed.
A number of sanity checks were added to make sure the stolen pages
don't have weird flags set, etc... These could be moved into generic
splice/steal code.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acquire a page ref on pages in ->readpages() and release them when the
read has finished. Not acquiring a reference didn't seem to cause any
trouble since the page is locked and will not be kicked out of the
page cache during the read.
However the following patches will want to remove the page from the
cache so a separate ref is needed. Making the reference in req->pages
explicit also makes the code easier to understand.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Replace uses of get_user_pages() with get_user_pages_fast(). It looks
nicer and should be faster in most cases.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
"map" isn't needed any more after: 0bd87182d3 "fuse: fix kunmap in
fuse_ioctl_copy_user"
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The cache alias problem will happen if the changes of user shared mapping
is not flushed before copying, then user and kernel mapping may be mapped
into two different cache line, it is impossible to guarantee the coherence
after iov_iter_copy_from_user_atomic. So the right steps should be:
flush_dcache_page(page);
kmap_atomic(page);
write to page;
kunmap_atomic(page);
flush_dcache_page(page);
More precisely, we might create two new APIs flush_dcache_user_page and
flush_dcache_kern_page to replace the two flush_dcache_page accordingly.
Here is a snippet tested on omap2430 with VIPT cache, and I think it is
not ARM-specific:
int val = 0x11111111;
fd = open("abc", O_RDWR);
addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
*(addr+0) = 0x44444444;
tmp = *(addr+0);
*(addr+1) = 0x77777777;
write(fd, &val, sizeof(int));
close(fd);
The results are not always 0x11111111 0x77777777 at the beginning as expected. Sometimes we see 0x44444444 0x77777777.
Signed-off-by: Anfei <anfei.zhou@gmail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <linux-arch@vger.kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Looks like another victim of the confusing kmap() vs kmap_atomic() API
differences.
Reported-by: Todor Gyumyushev <yodor1@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
fuse_direct_io() has a loop where requests are allocated in each
iteration. if allocation fails, the loop is broken out and follows
into an unconditional fuse_put_request() on that invalid pointer.
Signed-off-by: Anand V. Avati <avati@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@kernel.org
* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code
But leave TTM code alone, something is fishy there with global vm_ops
being used.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use ff->fc and ff->nodeid instead of file->f_dentry->d_inode in the
fuse_file_poll() implementation.
This prepares this function for use by CUSE, where the inode is not
owned by a fuse filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Create a helper for sending an IOCTL request that doesn't use a struct
inode.
This prepares this function for use by CUSE, where the inode is not
owned by a fuse filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Make fuse_sync_release() a generic helper function that doesn't need a
struct inode pointer. This makes it suitable for use by CUSE.
Change return value of fuse_release_common() from int to void.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>