Commit Graph

52613 Commits

Author SHA1 Message Date
Yan, Zheng
8d8f371c83 ceph: cleanup traceless reply handling for rename
ceph_fill_trace() already calls ceph_invalidate_dir_request() for
traceless reply. No need to duplicate the code in ceph_rename().

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-01-29 18:36:06 +01:00
Yan, Zheng
87c91a965a ceph: voluntarily drop Fx cap for readdir request
MDS need to rdlock directory inode's filelock when handling readdir
request. Voluntarily dropping CEPH_CAP_AUTH_EXCL avoids a cap revoke
message.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-01-29 18:36:05 +01:00
Yan, Zheng
be70489eff ceph: properly drop caps for setattr request
For CEPH_SETATTR_ATIME, MDS needs to xlock filelock, Fsxrw caps
are not allowed for xlocked filelock.

For CEPH_SETATTR_SIZE request that truncates file to smaller size,
MDS needs to xlock filelock, Fsxrw caps are not allowed for xlocked
filelock.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-01-29 18:36:05 +01:00
Yan, Zheng
d19a0b5401 ceph: voluntarily drop Lx cap for link/rename requests
MDS need to xlock inode's linklock when handling link/rename requests.
Voluntarily dropping CEPH_CAP_AUTH_EXCL avoids a cap revoke message.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-01-29 18:36:04 +01:00
Yan, Zheng
222b7f90ba ceph: voluntarily drop Ax cap for requests that create new inode
MDS need to rdlock directory inode's authlock when handling these
requests. Voluntarily dropping CEPH_CAP_AUTH_EXCL avoids a cap revoke
message.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-01-29 18:36:04 +01:00
Bob Peterson
2eb5909dee GFS2: Don't try to end a non-existent transaction in unlink
Before this patch, if function gfs2_unlink failed to get a valid
transaction (for example, not enough journal blocks) it would go
to label out_end_trans which did gfs2_trans_end. But if the
trans_begin failed, there's no transaction to end, and trying to
do so results in: kernel BUG at fs/gfs2/trans.c:117!

This patch changes the goto so that it does not try to end a
non-existent transaction.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-29 10:00:23 -07:00
Christoph Hellwig
1e369b0e19 xfs: remove experimental tag for reflinks
But reject reflink + DAX file systems for now until the code to
support reflinks on DAX is actually implemented.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: port to 4.16]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29 07:27:24 -08:00
Darrick J. Wong
6d8a45ce29 xfs: don't screw up direct writes when freesp is fragmented
xfs_bmap_btalloc is given a range of file offset blocks that must be
allocated to some data/attr/cow fork.  If the fork has an extent size
hint associated with it, the request will be enlarged on both ends to
try to satisfy the alignment hint.  If free space is fragmentated,
sometimes we can allocate some blocks but not enough to fulfill any of
the requested range.  Since bmapi_allocate always trims the new extent
mapping to match the originally requested range, this results in
bmapi_write returning zero and no mapping.

The consequences of this vary -- buffered writes will simply re-call
bmapi_write until it can satisfy at least one block from the original
request.  Direct IO overwrites notice nmaps == 0 and return -ENOSPC
through the dio mechanism out to userspace with the weird result that
writes fail even when we have enough space because the ENOSPC return
overrides any partial write status.  For direct CoW writes the situation
was disastrous because nobody notices us returning an invalid zero-length
wrong-offset mapping to iomap and the write goes off into space.

Therefore, if free space is so fragmented that we managed to allocate
some space but not enough to map into even a single block of the
original allocation request range, we should break the alignment hint in
order to guarantee at least some forward progress for the direct write.
If we return a short allocation to iomap_apply it'll call back about the
remaining blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:24 -08:00
Darrick J. Wong
9f37bd11b4 xfs: check reflink allocation mappings
There's a really bad bug in xfs_reflink_allocate_cow -- if bmapi_write
can return a zero error code but no mappings.  This happens if there's
an extent size hint (which causes allocation requests to be rounded to
extsz granularity internally), but there wasn't a big enough chunk of
free space to start filling at the extsz granularity and fill even one
block of the range that we actually requested.

In any case, if we got no mappings we can't possibly do anything useful
with the contents of imap, so we must bail out with ENOSPC here.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:24 -08:00
Darrick J. Wong
0c6dda7a1c iomap: warn on zero-length mappings
Don't let the iomap callback get away with feeding us a garbage zero
length mapping -- there was a bug in xfs that resulted in those leaking
out to hilarious effect.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:24 -08:00
Darrick J. Wong
4b4c1326fd xfs: treat CoW fork operations as delalloc for quota accounting
Since the CoW fork only exists in memory, it is incorrect to update the
on-disk quota block counts when we modify the CoW fork.  Unlike the data
fork, even real extents in the CoW fork are only delalloc-style
reservations (on-disk they're owned by the refcountbt) so they must not
be tracked in the on disk quota info.  Ensure the i_delayed_blks
accounting reflects this too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:23 -08:00
Darrick J. Wong
01c2e13dca xfs: only grab shared inode locks for source file during reflink
Reflink and dedupe operations remap blocks from a source file into a
destination file.  The destination file needs exclusive locks on all
levels because we're updating its block map, but the source file isn't
undergoing any block map changes so we can use a shared lock.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:23 -08:00
Darrick J. Wong
7c2d238ac6 xfs: allow xfs_lock_two_inodes to take different EXCL/SHARED modes
Refactor xfs_lock_two_inodes to take separate locking modes for each
inode.  Specifically, this enables us to take a SHARED lock on one inode
and an EXCL lock on the other.  The lock class (MMAPLOCK/ILOCK) must be
the same for each inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:23 -08:00
Darrick J. Wong
1364b1d4b5 xfs: reflink should break pnfs leases before sharing blocks
Before we share blocks between files, we need to break the pnfs leases
on the layout before we start slicing and dicing the block map.  The
structure of this function sets us up for the lock contention reduction
in the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:23 -08:00
Darrick J. Wong
c47b74fb2d xfs: don't clobber inobt/finobt cursors when xref with rmap
Even if we can't use the inobt/finobt cursors to count the number of
inode btree blocks, we are never allowed to clobber the cursor of the
btree being checked, so don't do this.  Found by fuzzing level = ones
in xfs/364.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:23 -08:00
Darrick J. Wong
70c57dcd60 xfs: skip CoW writes past EOF when writeback races with truncate
Every so often we blow the ASSERT(type != XFS_IO_COW) in xfs_map_blocks
when running fsstress, as we do in generic/269.  The cause of this is
writeback racing with truncate -- writeback doesn't take the iolock, so
truncate can sneak in to decrease i_size and truncate page cache while
writeback is gathering buffer heads to schedule writeout.

If we hit this race on a block that has a CoW mapping, we'll get a valid
imap from the CoW fork but the reduced i_size trims the mapping to zero
length (which makes it invalid), so we call xfs_map_blocks to try again.
This doesn't do much anyway, since any mapping we get out of that will
also be invalid, so we might as well skip the assert and just stop.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:23 -08:00
Amir Goldstein
acd1d71598 xfs: preserve i_rdev when recycling a reclaimable inode
Commit 66f364649d ("xfs: remove if_rdev") moved storing of rdev
value for special inodes to VFS inodes, but forgot to preserve the
value of i_rdev when recycling a reclaimable xfs_inode.

This was detected by xfstest overlay/017 with inodex=on mount option
and xfs base fs. The test does a lookup of overlay chardev and blockdev
right after drop caches.

Overlayfs inodes hold a reference on underlying xfs inodes when mount
option index=on is configured. If drop caches reclaim xfs inodes, before
it relclaims overlayfs inodes, that can sometimes leave a reclaimable xfs
inode and that test hits that case quite often.

When that happens, the xfs inode cache remains broken (zere i_rdev)
until the next cycle mount or drop caches.

Fixes: 66f364649d ("xfs: remove if_rdev")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29 07:27:23 -08:00
Darrick J. Wong
751f3767c2 xfs: refactor accounting updates out of xfs_bmap_btalloc
Move all the inode and quota accounting updates out of xfs_bmap_btalloc
in preparation for fixing some quota accounting problems with copy on
write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-01-29 07:27:23 -08:00
Darrick J. Wong
22431bf3df xfs: refactor inode verifier corruption error printing
Refactor inode verifier error reporting into a non-libxfs function so
that we aren't encoding the message format in libxfs.  This also
changes the kernel dmesg output to resemble buffer verifier errors
more closely.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:22 -08:00
Darrick J. Wong
67a3f6d014 xfs: make tracepoint inode number format consistent
Fix all the inode number formats to be consistently (0x%llx) in all
trace point definitions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:22 -08:00
Darrick J. Wong
beaae8cd58 xfs: always zero di_flags2 when we free the inode
Always zero the di_flags2 field when we free the inode so that we never
end up with an on-disk record for an unallocated inode that also has the
reflink iflag set.  This is in keeping with the general principle that
only files can have the reflink iflag set, even though we'll zero out
di_flags2 if we ever reallocate the inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:22 -08:00
Darrick J. Wong
09ac862397 xfs: call xfs_qm_dqattach before performing reflink operations
Ensure that we've attached all the necessary dquots before performing
reflink operations so that quota accounting is accurate.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:22 -08:00
Shan Hai
6ca30729c2 xfs: bmap code cleanup
Remove the extent size hint and realtime inode relevant code from
the xfs_bmapi_reserve_delalloc since it is not called on the inode
with extent size hint set or on a realtime inode.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29 07:27:22 -08:00
Carlos Maiolino
643c8c05e7 Use list_head infra-structure for buffer's log items list
Now that buffer's b_fspriv has been split, just replace the current
singly linked list of xfs_log_items, by the list_head infrastructure.

Also, remove the xfs_log_item argument from xfs_buf_resubmit_failed_buffers(),
there is no need for this argument, once the log items can be walked
through the list_head in the buffer.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: minor style cleanups]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29 07:27:22 -08:00
Carlos Maiolino
fb1755a645 Split buffer's b_fspriv field
By splitting the b_fspriv field into two different fields (b_log_item
and b_li_list). It's possible to get rid of an old ABI workaround, by
using the new b_log_item field to store xfs_buf_log_item separated from
the log items attached to the buffer, which will be linked in the new
b_li_list field.

This way, there is no more need to reorder the log items list to place
the buf_log_item at the beginning of the list, simplifying a bit the
logic to handle buffer IO.

This also opens the possibility to change buffer's log items list into a
proper list_head.

b_log_item field is still defined as a void *, because it is still used
by the log buffers to store xlog_in_core structures, and there is no
need to add an extra field on xfs_buf just for xlog_in_core.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: minor style changes]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29 07:27:22 -08:00
Carlos Maiolino
70a2065533 Get rid of xfs_buf_log_item_t typedef
Take advantage of the rework on xfs_buf log items list, to get rid of
ths typedef for xfs_buf_log_item.

This patch also fix some indentation alignment issues found along the way.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29 07:27:22 -08:00
David S. Miller
3e3ab9ccca Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 10:15:51 -05:00
Jeff Layton
3a8c7231d5 btrfs: only dirty the inode in btrfs_update_time if something was changed
At this point, we know that "now" and the file times may differ, and we
suspect that the i_version has been flagged to be bumped. Attempt to
bump the i_version, and only mark the inode dirty if that actually
occurred or if one of the times was updated.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2018-01-29 06:42:21 -05:00
Jeff Layton
d17260fd5f xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing
If XFS_ILOG_CORE is already set then go ahead and increment it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
2018-01-29 06:42:21 -05:00
Jeff Layton
e38cf302b2 fs: only set S_VERSION when updating times if necessary
We only really need to update i_version if someone has queried for it
since we last incremented it. By doing that, we can avoid having to
update the inode if the times haven't changed.

If the times have changed, then we go ahead and forcibly increment the
counter, under the assumption that we'll be going to the storage
anyway, and the increment itself is relatively cheap.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-01-29 06:42:21 -05:00
Jeff Layton
f0e2828062 xfs: convert to new i_version API
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
2018-01-29 06:42:21 -05:00
Jeff Layton
bb8c2d66bc ufs: use new i_version API
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-01-29 06:42:21 -05:00
Jeff Layton
cc56c33e78 ocfs2: convert to new i_version API
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-01-29 06:42:21 -05:00
Jeff Layton
1f15a550f5 nfsd: convert to new i_version API
Mostly just making sure we use the "get" wrappers so we know when
it is being fetched for later use.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-01-29 06:42:21 -05:00
Jeff Layton
1eb5d98f16 nfs: convert to new i_version API
For NFS, we just use the "raw" API since the i_version is mostly
managed by the server. The exception there is when the client
holds a write delegation, but we only need to bump it once
there anyway to handle CB_GETATTR.

Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-01-29 06:42:21 -05:00
Jeff Layton
ee73f9a52a ext4: convert to new i_version API
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
2018-01-29 06:42:21 -05:00
Jeff Layton
e1d747d9b6 ext2: convert to new i_version API
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-01-29 06:42:20 -05:00
Jeff Layton
317bc94780 exofs: switch to new i_version API
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-01-29 06:42:20 -05:00
Jeff Layton
c7f88c4e78 btrfs: convert to new i_version API
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: David Sterba <dsterba@suse.com>
2018-01-29 06:42:20 -05:00
Jeff Layton
a01179e6eb afs: convert to new i_version API
For AFS, it's generally treated as an opaque value, so we use the
*_raw variants of the API here.

Note that AFS has quite a different definition for this counter. AFS
only increments it on changes to the data to the data in regular files
and contents of the directories. Inode metadata changes do not result
in a version increment.

We'll need to reconcile that somehow if we ever want to present this to
userspace via statx.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-01-29 06:42:20 -05:00
Jeff Layton
9dffe569d9 affs: convert to new i_version API
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-01-29 06:42:20 -05:00
Jeff Layton
2489dbabea fat: convert to new i_version API
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-01-29 06:42:20 -05:00
Jeff Layton
ae5e165d85 fs: new API for handling inode->i_version
Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.

We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.

Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-01-29 06:41:30 -05:00
Trond Myklebust
e231c6879c NFS: Fix a race between mmap() and O_DIRECT
When locking the file in order to do O_DIRECT on it, we must unmap
any mmapped ranges on the pagecache so that we can flush out the
dirty data.

Fixes: a5864c999d ("NFS: Do not serialise O_DIRECT reads and writes")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.8+
2018-01-28 22:00:15 -05:00
Achilles Gaikwad
36c7ce4a17 fs/cifs/cifsacl.c Fixes typo in a comment
Signed-off-by: Achilles Gaikwad <achillesgaikwad@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-01-28 09:19:45 -06:00
Trond Myklebust
128159f292 NFS: Remove a redundant call to unmap_mapping_range()
We don't need to call unmap_mapping_range() prior to calling
nfs_sync_mapping().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-28 09:35:54 -05:00
Steve French
ab2c643309 update internal version number for cifs.ko
To version 2.11

Signed-off-by: Steve French <smfrench@gmail.com>
2018-01-26 17:03:01 -06:00
Andrés Souto
cd1aca29fa cifs: add .splice_write
add splice_write support in cifs vfs using iter_file_splice_write

Signed-off-by: Andrés Souto <kai670@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-01-26 17:03:01 -06:00
Aurelien Aptel
4a1360d01d CIFS: document tcon/ses/server refcount dance
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-26 17:03:00 -06:00
Steve French
6b314714ff move a few externs to smbdirect.h to eliminate warning
Quiet minor sparse warnings in new SMB3 rdma patch series
("symbol was not declared ...") by moving these externs to smbdirect.h

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-26 17:03:00 -06:00
Aurelien Aptel
97f4b7276b CIFS: zero sensitive data when freeing
also replaces memset()+kfree() by kzfree().

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Cc: <stable@vger.kernel.org>
2018-01-26 17:03:00 -06:00
Steve French
2026b06e9c Cleanup some minor endian issues in smb3 rdma
Minor cleanup of some sparse warnings (including a few misc
endian fixes for the new smb3 rdma code)

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-26 17:03:00 -06:00
Aurelien Aptel
02cf5905e3 CIFS: dump IPC tcon in debug proc file
dump it as first share with an "IPC: " prefix.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-26 17:03:00 -06:00
Aurelien Aptel
63a83b861c CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl
Since IPC now has a tcon object, the caller can just pass it. This
allows domain-based DFS requests to work with smb2+.

Link: https://bugzilla.samba.org/show_bug.cgi?id=12917
Fixes: 9d49640a21 ("CIFS: implement get_dfs_refer for SMB2+")
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-26 17:03:00 -06:00
Aurelien Aptel
b327a717e5 CIFS: make IPC a regular tcon
* Remove ses->ipc_tid.
* Make IPC$ regular tcon.
* Add a direct pointer to it in ses->tcon_ipc.
* Distinguish PIPE tcon from IPC tcon by adding a tcon->pipe flag. All
  IPC tcons are pipes but not all pipes are IPC.
* All TreeConnect functions now cannot take a NULL tcon object.

The IPC tcon has the same lifetime as the session it belongs to. It is
created when the session is created and destroyed when the session is
destroyed.

Since no mounts directly refer to the IPC tcon, its refcount should
always be set to initialisation value (1). Thus we make sure
cifs_put_tcon() skips it.

If the mount request resulting in a new session being created requires
encryption, try to require it too for IPC.

* set SERVER_NAME_LENGTH to serverName actual size

The maximum length of an ipv6 string representation is defined in
INET6_ADDRSTRLEN as 45+1 for null but lets keep what we know works.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-26 17:03:00 -06:00
Martin Brandenburg
6793f1c450 orangefs: fix deadlock; do not write i_size in read_iter
After do_readv_writev, the inode cache is invalidated anyway, so i_size
will never be read.  It will be fetched from the server which will also
know about updates from other machines.

Fixes deadlock on 32-bit SMP.

See https://marc.info/?l=linux-fsdevel&m=151268557427760&w=2

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-25 17:26:24 -08:00
Jake Daryll Obina
5bdd0c6f89 jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
If jffs2_iget() fails for a newly-allocated inode, jffs2_do_clear_inode()
can get called twice in the error handling path, the first call in
jffs2_iget() itself and the second through iget_failed(). This can result
to a use-after-free error in the second jffs2_do_clear_inode() call, such
as shown by the oops below wherein the second jffs2_do_clear_inode() call
was trying to free node fragments that were already freed in the first
jffs2_do_clear_inode() call.

[   78.178860] jffs2: error: (1904) jffs2_do_read_inode_internal: CRC failed for read_inode of inode 24 at physical location 0x1fc00c
[   78.178914] Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b6b7b
[   78.185871] pgd = ffffffc03a567000
[   78.188794] [6b6b6b6b6b6b6b7b] *pgd=0000000000000000, *pud=0000000000000000
[   78.194968] Internal error: Oops: 96000004 [#1] PREEMPT SMP
...
[   78.513147] PC is at rb_first_postorder+0xc/0x28
[   78.516503] LR is at jffs2_kill_fragtree+0x28/0x90 [jffs2]
[   78.520672] pc : [<ffffff8008323d28>] lr : [<ffffff8000eb1cc8>] pstate: 60000105
[   78.526757] sp : ffffff800cea38f0
[   78.528753] x29: ffffff800cea38f0 x28: ffffffc01f3f8e80
[   78.532754] x27: 0000000000000000 x26: ffffff800cea3c70
[   78.536756] x25: 00000000dc67c8ae x24: ffffffc033d6945d
[   78.540759] x23: ffffffc036811740 x22: ffffff800891a5b8
[   78.544760] x21: 0000000000000000 x20: 0000000000000000
[   78.548762] x19: ffffffc037d48910 x18: ffffff800891a588
[   78.552764] x17: 0000000000000800 x16: 0000000000000c00
[   78.556766] x15: 0000000000000010 x14: 6f2065646f6e695f
[   78.560767] x13: 6461657220726f66 x12: 2064656c69616620
[   78.564769] x11: 435243203a6c616e x10: 7265746e695f6564
[   78.568771] x9 : 6f6e695f64616572 x8 : ffffffc037974038
[   78.572774] x7 : bbbbbbbbbbbbbbbb x6 : 0000000000000008
[   78.576775] x5 : 002f91d85bd44a2f x4 : 0000000000000000
[   78.580777] x3 : 0000000000000000 x2 : 000000403755e000
[   78.584779] x1 : 6b6b6b6b6b6b6b6b x0 : 6b6b6b6b6b6b6b6b
...
[   79.038551] [<ffffff8008323d28>] rb_first_postorder+0xc/0x28
[   79.042962] [<ffffff8000eb5578>] jffs2_do_clear_inode+0x88/0x100 [jffs2]
[   79.048395] [<ffffff8000eb9ddc>] jffs2_evict_inode+0x3c/0x48 [jffs2]
[   79.053443] [<ffffff8008201ca8>] evict+0xb0/0x168
[   79.056835] [<ffffff8008202650>] iput+0x1c0/0x200
[   79.060228] [<ffffff800820408c>] iget_failed+0x30/0x3c
[   79.064097] [<ffffff8000eba0c0>] jffs2_iget+0x2d8/0x360 [jffs2]
[   79.068740] [<ffffff8000eb0a60>] jffs2_lookup+0xe8/0x130 [jffs2]
[   79.073434] [<ffffff80081f1a28>] lookup_slow+0x118/0x190
[   79.077435] [<ffffff80081f4708>] walk_component+0xfc/0x28c
[   79.081610] [<ffffff80081f4dd0>] path_lookupat+0x84/0x108
[   79.085699] [<ffffff80081f5578>] filename_lookup+0x88/0x100
[   79.089960] [<ffffff80081f572c>] user_path_at_empty+0x58/0x6c
[   79.094396] [<ffffff80081ebe14>] vfs_statx+0xa4/0x114
[   79.098138] [<ffffff80081ec44c>] SyS_newfstatat+0x58/0x98
[   79.102227] [<ffffff800808354c>] __sys_trace_return+0x0/0x4
[   79.106489] Code: d65f03c0 f9400001 b40000e1 aa0103e0 (f9400821)

The jffs2_do_clear_inode() call in jffs2_iget() is unnecessary since
iget_failed() will eventually call jffs2_do_clear_inode() if needed, so
just remove it.

Fixes: 5451f79f5f ("iget: stop JFFS2 from using iget() and read_inode()")
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Jake Daryll Obina <jake.obina@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-25 19:34:30 -05:00
Alexey Dobriyan
b35d786b67 dcache: delete unused d_hash_mask
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-25 19:34:30 -05:00
Alexey Dobriyan
854d3e6343 dcache: subtract d_hash_shift from 32 in advance
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-25 19:34:29 -05:00
Eric Biggers
01950a349e fs/buffer.c: fold init_buffer() into init_page_buffers()
Since commit e76004093d ("fs/buffer.c: remove unnecessary init
operation after allocating buffer_head"), there are no callers of
init_buffer() outside of init_page_buffers().  So just fold it into
init_page_buffers().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-25 19:34:28 -05:00
Eric Biggers
4bfd054ae1 fs: fold __inode_permission() into inode_permission()
Since commit 9c630ebefe ("ovl: simplify permission checking"),
overlayfs doesn't call __inode_permission() anymore, which leaves no
users other than inode_permission().  So just fold it back into
inode_permission().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-25 19:34:28 -05:00
Chao Yu
1c1d35df71 f2fs: support inode creation time
This patch adds creation time field in inode layout to support showing
kstat.btime in ->statx.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-25 14:10:39 -08:00
Benjamin Coddington
f34462c3c8 pnfs/blocklayout: Ensure disk address in block device map
It's possible that the device map is smaller than the offset into the device
for the I/O we're adding.  Add a check for it and bail out, otherwise we
risk botching the bio calculations that follow.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trondmy@gmail.com>
2018-01-25 16:42:35 -05:00
Benjamin Coddington
b39604755c pnfs/blocklayout: pnfs_block_dev_map uses bytes, not sectors
Fixup the field types to match their use.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trondmy@gmail.com>
2018-01-25 16:42:35 -05:00
Yunlei He
068c3cd858 f2fs: rebuild sit page from sit info in mem
This patch rebuild sit page from sit info in mem instead
of issue a read io.

I test this method and the result is as below:

Pre:
 mmc_perf_test-12061 [001] ...1   976.819992: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [001] ...1   976.856446: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [003] ...1   998.976946: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [003] ...1   999.023269: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [003] ...1  1022.060772: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [003] ...1  1022.111034: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [002] ...1  1070.127643: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [003] ...1  1070.187352: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [003] ...1  1095.942124: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [003] ...1  1095.995975: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [003] ...1  1122.535091: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [003] ...1  1122.586521: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [001] ...1  1147.897487: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [001] ...1  1147.959438: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [003] ...1  1177.926951: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [002] ...1  1177.976823: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
 mmc_perf_test-12061 [002] ...1  1204.176087: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
 mmc_perf_test-12061 [002] ...1  1204.239046: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit

Some sit flush consume more than 50ms.

Now:
mmc_perf_test-2187  [007] ...1   196.840684: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [007] ...1   196.841258: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [007] ...1   219.430582: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [007] ...1   219.431144: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [002] ...1   243.638678: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [000] ...1   243.638980: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [002] ...1   265.392180: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [002] ...1   265.392245: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [000] ...1   290.309051: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [000] ...1   290.309116: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [003] ...1   317.144209: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [003] ...1   317.145913: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [005] ...1   343.224954: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [005] ...1   343.225574: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [000] ...1   370.239846: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [000] ...1   370.241138: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [001] ...1   397.029043: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [001] ...1   397.030750: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit
mmc_perf_test-2187  [003] ...1   425.386377: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit
mmc_perf_test-2187  [003] ...1   425.387735: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit

Most sit flush consume no more than 1ms.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-25 10:44:34 -08:00
Chao Yu
3b60d802d9 f2fs: stop issuing discard if fs is readonly
If filesystem is readonly, stop to issue discard in daemon.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-25 10:40:10 -08:00
Chao Yu
6819b884e0 f2fs: clean up duplicated assignment in init_discard_policy
Remove duplicated codes of assignment for .max_requests and .io_aware_gran.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-25 10:40:01 -08:00
Chao Yu
2882d34310 f2fs: use GFP_F2FS_ZERO for cleanup
Clean up codes with GFP_F2FS_ZERO, no logic changes.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-25 10:39:49 -08:00
Bob Peterson
4519eaad72 GFS2: Fix minor comment typo
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-25 10:18:06 -07:00
Linus Torvalds
525273fb2e for-4.15
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlppIo8ACgkQxWXV+ddt
 WDuu9A//e29aPiicPrtIovNfVCzev774UddFBhIiEqyfbO0QvHk7Ld0d9+htR38W
 eAwxJG3gH4VS9yOFAyk/LrsjpQZYzP32wjGoIXQlz8J1SzxALlILLmjpYkZWHwZX
 UqAomYDnRjFu4jsHZ/AAyUNngYER8aeH+aO1PudppBiSCIVpk1TAVRnkt04gCJ7e
 CigFgMdRc8pT5P0T6Rsv2+W/yJC7sU0BgDIBjmUcnduqEAxRYl0zsJzpP0IYPPRa
 cAnpyu/ApK/m9mlLWi0SfUyNvePFWNxfA2QDBO5G6FwM6C2f5x/zBGfu29wAupVJ
 czdOSR5uqCd6WGZHfvaQ4cgQ69AE4lk68zijHRNESPa7tVZ9mQt713h2BqAX6Fus
 z+y45ti9CrFoOhaoXCSjUbpJZQ7YKWrat44Qi8pJaKGxqiqXpT8oVcBfjofu+Drn
 vnDrU+7WvST5TxcWw/JBWaFfft1dkbdKX+EYZW8sVJAmA/e0+aIfElsTuvdUQxuA
 aLnsyvsYQHVcA/mArD12sScQ13DDejR4ija8bP5tpmodxquUouwEAtKPoFnXVT6k
 8hbb6qPLsiBerRMSSuaNaFF/ESi36bsizH3O99bYyyxlR8bXi7irHZY9VPWc1OK3
 4twoDO0RhCxDHPY5mrIcKdlp9HiiUBPGos3BCaOAg8rKHd/QP7Q=
 =EOAu
 -----END PGP SIGNATURE-----

Merge tag 'for-4.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "It's been reported recently that readdir can list stale entries under
  some conditions. Fix it."

* tag 'for-4.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix stale entries in readdir
2018-01-25 09:03:10 -08:00
Colin Ian King
37e12f5551 cifs: remove redundant duplicated assignment of pointer 'node'
Node is assigned twice to rb_first(root), first during declaration
time and second after a taking a spin lock, so we have a duplicated
assignment.  Remove the first assignment because it is redundant and
also not protected by the spin lock.

Cleans up clang warning:
fs/cifs/connect.c:4435:18: warning: Value stored to 'node' during
its initialization is never read

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-24 19:49:07 -06:00
Arnd Bergmann
e36c048a9b CIFS: SMBD: work around gcc -Wmaybe-uninitialized warning
GCC versions from 4.9 to 6.3 produce a false-positive warning when
dealing with a conditional spin_lock_irqsave():

fs/cifs/smbdirect.c: In function 'smbd_recv_buf':
include/linux/spinlock.h:260:3: warning: 'flags' may be used uninitialized in this function [-Wmaybe-uninitialized]

This function calls some sleeping interfaces, so it is clear that it
does not get called with interrupts disabled and there is no need
to save the irq state before taking the spinlock. This lets us
remove the variable, which makes the function slightly more efficient
and avoids the warning.

A further cleanup could do the same change for other functions in this
file, but I did not want to take this too far for now.

Fixes: ac69f66e54ca ("CIFS: SMBD: Implement function to receive data via RDMA receive")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-01-24 19:49:07 -06:00
Daniel N Pettersson
9aca7e4544 cifs: Fix autonegotiate security settings mismatch
Autonegotiation gives a security settings mismatch error if the SMB
server selects an SMBv3 dialect that isn't SMB3.02. The exact error is
"protocol revalidation - security settings mismatch".
This can be tested using Samba v4.2 or by setting the global Samba
setting max protocol = SMB3_00.

The check that fails in smb3_validate_negotiate is the dialect
verification of the negotiate info response. This is because it tries
to verify against the protocol_id in the global smbdefault_values. The
protocol_id in smbdefault_values is SMB3.02.
In SMB2_negotiate the protocol_id in smbdefault_values isn't updated,
it is global so it probably shouldn't be, but server->dialect is.

This patch changes the check in smb3_validate_negotiate to use
server->dialect instead of server->vals->protocol_id. The patch works
with autonegotiate and when using a specific version in the vers mount
option.

Signed-off-by: Daniel N Pettersson <danielnp@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
2018-01-24 19:49:07 -06:00
kbuild test robot
9084432c31 CIFS: SMBD: _smbd_get_connection() can be static
Fixes: 07495ff5d9bc ("CIFS: SMBD: Establish SMB Direct connection")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Long Li <longli@microsoft.com>
2018-01-24 19:49:07 -06:00
Long Li
8801e90233 CIFS: SMBD: Disable signing on SMB direct transport
Currently the CIFS SMB Direct implementation (experimental) doesn't properly
support signing. Disable it when SMB Direct is in use for transport.

Signing will be enabled in future after it is implemented.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-24 19:49:07 -06:00
Long Li
08a3b9690f CIFS: SMBD: Add SMB Direct debug counters
For debugging and troubleshooting, export SMBDirect debug counters to
/proc/fs/cifs/DebugData.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-24 19:49:07 -06:00
Long Li
bd3dcc6a22 CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration
If I/O size is larger than rdma_readwrite_threshold, use RDMA write for
SMB read by specifying channel SMB2_CHANNEL_RDMA_V1 or
SMB2_CHANNEL_RDMA_V1_INVALIDATE in the SMB packet, depending on SMB dialect
used. Append a smbd_buffer_descriptor_v1 to the end of the SMB packet and fill
in other values to indicate this SMB read uses RDMA write.

There is no need to read from the transport for incoming payload. At the time
SMB read response comes back, the data is already transferred and placed in the
pages by RDMA hardware.

When SMB read is finished, deregister the memory regions if RDMA write is used
for this SMB read. smbd_deregister_mr may need to do local invalidation and
sleep, if server remote invalidation is not used.

There are situations where the MID may not be created on I/O failure, under
which memory region is deregistered when read data context is released.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-24 19:49:07 -06:00
Long Li
74dcf418fe CIFS: SMBD: Read correct returned data length for RDMA write (SMB read) I/O
This patch is for preparing upper layer doing SMB read via RDMA write.

When RDMA write is used for SMB read, the returned data length is in
DataRemaining in the response packet. Reading it properly by adding a
parameter to specifiy where the returned data length is.

Add the defition for memory registration to wdata and return the correct
length based on if RDMA write is used.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-24 19:49:07 -06:00
Long Li
db223a590d CIFS: SMBD: Upper layer performs SMB write via RDMA read through memory registration
When sending I/O, if size is larger than rdma_readwrite_threshold we prepare
to send SMB write packet for a RDMA read via memory registration. The actual
I/O is done by remote peer through local RDMA hardware. Modify the relevant
fields in the packet accordingly, and append a smbd_buffer_descriptor_v1 to
the end of the SMB write packet.

On write I/O finish, deregister the memory region if this was for a RDMA read.
If remote invalidation is not used, the call to smbd_deregister_mr will do
local invalidation and possibly wait. Memory region is normally deregistered
in MID callback as soon as it's used. There are situations where the MID may
not be created on I/O failure, under which memory region is deregistered when
write data context is released.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-24 19:49:07 -06:00
Long Li
c739858334 CIFS: SMBD: Implement RDMA memory registration
Memory registration is used for transferring payload via RDMA read or write.
After I/O is done, memory registrations are recovered and reused. This
process can be time consuming and is done in a work queue.

Signed-off-by: Long Li <longli@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-01-24 19:49:06 -06:00
Long Li
9762c2d080 CIFS: SMBD: Upper layer sends data via RDMA send
With SMB Direct connected, use it for sending data via RDMA send.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-24 19:49:06 -06:00
Long Li
d649e1bba3 CIFS: SMBD: Implement function to send data via RDMA send
The transport doesn't maintain send buffers or send queue for transferring
payload via RDMA send. There is no data copy in the transport on send.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-24 19:49:06 -06:00
Long Li
2fef137a2e CIFS: SMBD: Upper layer receives data via RDMA receive
With SMB Direct connected, use it for receiving data via RDMA receive.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-24 19:49:06 -06:00
Long Li
f64b78fd18 CIFS: SMBD: Implement function to receive data via RDMA receive
On the receive path, the transport maintains receive buffers and a reassembly
queue for transferring payload via RDMA recv. There is data copy in the
transport on recv when it copies the payload to upper layer.

The transport recognizes the RFC1002 header length use in the SMB
upper layer payloads in CIFS. Because this length is mainly used for TCP and
not applicable to RDMA, it is handled as a out-of-band information and is
never sent over the wire, and the trasnport behaves like TCP to upper layer
by processing and exposing the length correctly on data payloads.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-24 19:49:06 -06:00
Long Li
09902f8dc8 CIFS: SMBD: Set SMB Direct maximum read or write size for I/O
When connecting over SMB Direct, the transport negotiates its maximum I/O sizes
with the server and determines how to choose to do RDMA send/recv vs
read/write. Expose these maximum I/O sizes to upper layer so we will get the
correct sized payloads.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-24 19:49:06 -06:00
Long Li
bce9ce7cc0 CIFS: SMBD: Upper layer destroys SMB Direct session on shutdown or umount
When upper layer wants to umount, make it call shutdown on transport when
SMB Direct is used.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-24 19:49:06 -06:00
Long Li
8ef130f9ec CIFS: SMBD: Implement function to destroy a SMB Direct connection
Add function to tear down a SMB Direct connection. This is used by upper layer
to free all SMB Direct connection and transport resources.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-24 19:49:06 -06:00
Long Li
781a8050f2 CIFS: SMBD: Upper layer reconnects to SMB Direct session
Do a reconnect on SMB Direct when it is used as the connection. Reconnect can
happen for many reasons and it's mostly the decision of SMB2 upper layer.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-24 19:49:06 -06:00
Long Li
ad57b8e172 CIFS: SMBD: Implement function to reconnect to a SMB Direct transport
Add function to implement a reconnect to SMB Direct. This involves tearing down
the current connection and establishing/negotiating a new connection.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-24 19:49:06 -06:00
Long Li
2f8946464b CIFS: SMBD: Upper layer connects to SMBDirect session
When "rdma" is specified in the mount option, make CIFS connect to
SMB Direct.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-01-24 19:49:06 -06:00
Randy Dunlap
0933d6fa74 cifs: fix build errors for SMB_DIRECT
Prevent build errors when CIFS=y and INFINIBAND=m.

fs/cifs/smbdirect.o: In function `smbd_qp_async_error_upcall':
smbdirect.c:(.text+0x28c): undefined reference to `ib_event_msg'
fs/cifs/smbdirect.o: In function `smbd_destroy_rdma_work':
smbdirect.c:(.text+0xfde): undefined reference to `ib_drain_qp'
smbdirect.c:(.text+0xfea): undefined reference to `rdma_destroy_qp'
smbdirect.c:(.text+0x12a0): undefined reference to `ib_free_cq'
smbdirect.c:(.text+0x12ac): undefined reference to `ib_free_cq'
smbdirect.c:(.text+0x12b8): undefined reference to `ib_dealloc_pd'
smbdirect.c:(.text+0x12c4): undefined reference to `rdma_destroy_id'
fs/cifs/smbdirect.o: In function `_smbd_get_connection':
smbdirect.c:(.text+0x168c): undefined reference to `rdma_create_id'
smbdirect.c:(.text+0x1713): undefined reference to `rdma_resolve_addr'
smbdirect.c:(.text+0x1780): undefined reference to `rdma_resolve_route'
smbdirect.c:(.text+0x17e3): undefined reference to `rdma_destroy_id'
smbdirect.c:(.text+0x183d): undefined reference to `rdma_destroy_id'
smbdirect.c:(.text+0x199d): undefined reference to `ib_alloc_cq'
smbdirect.c:(.text+0x19d9): undefined reference to `ib_alloc_cq'
smbdirect.c:(.text+0x1a89): undefined reference to `rdma_create_qp'
smbdirect.c:(.text+0x1b3c): undefined reference to `rdma_connect'
smbdirect.c:(.text+0x2538): undefined reference to `rdma_destroy_qp'
smbdirect.c:(.text+0x2549): undefined reference to `ib_free_cq'
smbdirect.c:(.text+0x255a): undefined reference to `ib_free_cq'
smbdirect.c:(.text+0x2563): undefined reference to `ib_dealloc_pd'
smbdirect.c:(.text+0x256c): undefined reference to `rdma_destroy_id'
smbdirect.c:(.text+0x25f0): undefined reference to `__ib_alloc_pd'
smbdirect.c:(.text+0x26bb): undefined reference to `rdma_disconnect'
fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work':
smbdirect.c:(.text+0x62): undefined reference to `rdma_disconnect'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc:	Steve French <sfrench@samba.org>
Cc:	linux-cifs@vger.kernel.org
Cc:	samba-technical@lists.samba.org (moderated for non-subscribers)
Signed-off-by: Steve French <smfrench@gmail.com>
2018-01-24 19:49:06 -06:00
Matthew Wilcox
f04a703c3d cifs: Fix missing put_xid in cifs_file_strict_mmap
If cifs_zap_mapping() returned an error, we would return without putting
the xid that we got earlier.  Restructure cifs_file_strict_mmap() and
cifs_file_mmap() to be more similar to each other and have a single
point of return that always puts the xid.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
2018-01-24 19:49:06 -06:00
Long Li
d8ec913b17 CIFS: SMBD: export protocol initial values
For use-configurable SMB Direct protocol values, export them to /proc/fs/cifs.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-24 19:49:06 -06:00
Long Li
399f9539d9 CIFS: SMBD: Implement function to create a SMB Direct connection
The upper layer calls this function to connect to peer through SMB Direct.
Each SMB Direct connection is based on a RDMA RC Queue Pair.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-24 19:49:05 -06:00
Long Li
f198186aa9 CIFS: SMBD: Establish SMB Direct connection
Add code to implement the core functions to establish a SMB Direct connection.

1. Establish an RDMA connection to SMB server.
2. Negotiate and setup SMB Direct protocol.
3. Implement idle connection timer and credit management.

SMB Direct is enabled by setting CONFIG_CIFS_SMB_DIRECT.

Add to Makefile to enable building SMB Direct.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-24 19:49:05 -06:00
Long Li
03bee01d62 CIFS: SMBD: Add SMB Direct protocol initial values and constants
To prepare for protocol implementation, add constants and user-configurable
values for the SMB Direct protocol.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Ronnie Sahlberg <lsahlber.redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-24 19:49:05 -06:00
Long Li
8339dd32fb CIFS: SMBD: Add rdma mount option
Add "rdma" to CIFS mount options to connect to SMB Direct.
Add checks to validate this is used on SMB 3.X dialects.

To connect to SMBDirect, use "mount.cifs -o rdma,vers=3.x".
At the time of this patch, 3.x can be 3.0, 3.02 or 3.1.1.

Signed-off-by: Long Li <longli@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Ronnie Sahlberg <lsahlber.redhat.com>
2018-01-24 19:49:05 -06:00
Long Li
2b6ed88037 CIFS: SMBD: Introduce kernel config option CONFIG_CIFS_SMB_DIRECT
Build SMB Direct code when this option is set.

Signed-off-by: Long Li <longli@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Ronnie Sahlberg <lsahlber.redhat.com>
2018-01-24 19:49:05 -06:00
Long Li
2dabfd5bab CIFS: SMBD: Add parameter rdata to smb2_new_read_req
This patch is for preparing upper layer for doing SMB read via RDMA write.

When we assemble the SMB read packet header, we need to know the I/O layout
if this request is to use a RDMA write. rdata has all the information we need
for memory registration. Add rdata to smb2_new_read_req.

Signed-off-by: Long Li <longli@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Ronnie Sahlberg <lsahlber.redhat.com>
2018-01-24 19:49:05 -06:00
Ronnie Sahlberg
3cecf4865c cifs: avoid a kmalloc in smb2_send_recv/SendReceive2 for the common case
In both functions, use an array of 8 (arbitrary but should be big enough
for all current uses) iov and avoid having to kmalloc the array
for the common case.

If 8 is too small, then fall back to the original behaviour and use
kmalloc/kfree.

This should not change any behaviour but should save us a tiny amount of
cpu cycles.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-24 19:49:05 -06:00
Ronnie Sahlberg
305428acf0 cifs: remove small_smb2_init
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-01-24 19:49:05 -06:00
Ronnie Sahlberg
8eb7998e79 cifs: remove rfc1002 header from smb2_lease_ack
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-24 19:49:05 -06:00
Ronnie Sahlberg
5dfe69a407 cifs: remove unused variable from SMB2_read
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-01-24 19:49:05 -06:00
Ronnie Sahlberg
21ad9487ca cifs: remove rfc1002 header from smb2_oplock_break we get from server
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-01-24 19:49:05 -06:00
Ronnie Sahlberg
b2fb7fecc9 cifs: remove rfc1002 header from smb2_query_info_req
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-01-24 19:49:05 -06:00
Ronnie Sahlberg
7c00c3a625 cifs: remove rfc1002 header from smb2_query_directory_req
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-01-24 19:49:05 -06:00
Ronnie Sahlberg
2fc803efe6 cifs: remove rfc1002 header from smb2_set_info_req
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-01-24 19:49:05 -06:00
Ronnie Sahlberg
f5688a6d7c cifs: remove rfc1002 header from smb2 read/write requests
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-01-24 19:49:04 -06:00
Ronnie Sahlberg
ced93679cb cifs: remove rfc1002 header from smb2_lock_req
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-01-24 19:49:04 -06:00
Ronnie Sahlberg
1f444e4c06 cifs: remove rfc1002 header from smb2_flush_req
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-01-24 19:49:04 -06:00
Ronnie Sahlberg
4f33bc3587 cifs: remove rfc1002 header from smb2_create_req
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-01-24 19:49:04 -06:00
Ronnie Sahlberg
88ea5cb7d4 cifs: remove rfc1002 header from smb2_sess_setup_req
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-24 19:49:04 -06:00
Ronnie Sahlberg
661bb943a9 cifs: remove rfc1002 header from smb2_tree_connect_req
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-24 19:49:04 -06:00
Ronnie Sahlberg
7f7ae759fb cifs: remove rfc1002 header from smb2_echo_req
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-24 19:49:04 -06:00
Ronnie Sahlberg
9775468020 cifs: remove rfc1002 header from smb2_ioctl_req
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-24 19:49:04 -06:00
Ronnie Sahlberg
afcccefdc3 cifs: remove rfc1002 header from smb2_close_req
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-24 19:49:04 -06:00
Ronnie Sahlberg
4eecf4cfe1 cifs: remove rfc1002 header from smb2_tree_disconnect_req
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-24 19:49:04 -06:00
Ronnie Sahlberg
45305eda6b cifs: remove rfc1002 header from smb2_logoff_req
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-01-24 19:49:04 -06:00
Ronnie Sahlberg
13cacea7bb cifs: remove rfc1002 header from smb2_negotiate_req
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-01-24 19:49:04 -06:00
Ronnie Sahlberg
83b7739180 cifs: Add smb2_send_recv
This function is similar to SendReceive2 except it does not expect
a 4 byte rfc1002 length header in the first io vector.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-01-24 19:49:04 -06:00
Trond Myklebust
535cb8f319 lockd: Fix server refcounting
The server shouldn't actually delete the struct nlm_host until it hits
the garbage collector. In order to make that work correctly with the
refcount API, we can bump the refcount by one, and then use
refcount_dec_if_one() in the garbage collector.

Signed-off-by: Trond Myklebust <trondmy@gmail.com>
Acked-by: J. Bruce Fields <bfields@fieldses.org>
2018-01-24 17:33:57 -05:00
Josef Bacik
e4fd493c05 Btrfs: fix stale entries in readdir
In fixing the readdir+pagefault deadlock I accidentally introduced a
stale entry regression in readdir.  If we get close to full for the
temporary buffer, and then skip a few delayed deletions, and then try to
add another entry that won't fit, we will emit the entries we found and
retry.  Unfortunately we delete entries from our del_list as we find
them, assuming we won't need them.  However our pos will be with
whatever our last entry was, which could be before the delayed deletions
we skipped, so the next search will add the deleted entries back into
our readdir buffer.  So instead don't delete entries we find in our
del_list so we can make sure we always find our delayed deletions.  This
is a slight perf hit for readdir with lots of pending deletions, but
hopefully this isn't a common occurrence.  If it is we can revist this
and optimize it.

cc: stable@vger.kernel.org
Fixes: 23b5ec7494 ("btrfs: fix readdir deadlock with pagefault")
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-24 20:27:48 +01:00
Amir Goldstein
8383f17488 ovl: wire up NFS export operations
Now that NFS export operations are implemented, enable overlayfs NFS
export support if the "nfs_export" feature is enabled.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:26:06 +01:00
Amir Goldstein
0617015403 ovl: lookup indexed ancestor of lower dir
ovl_lookup_real() in lower layer walks back lower parents to find the
topmost indexed parent. If an indexed ancestor is found before reaching
lower layer root, ovl_lookup_real() is called recursively with upper
layer to walk back from indexed upper to the topmost connected/hashed
upper parent (or up to root).

ovl_lookup_real() in upper layer then walks forward to connect the topmost
upper overlay dir dentry and ovl_lookup_real() in lower layer continues to
walk forward to connect the decoded lower overlay dir dentry.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:26:05 +01:00
Amir Goldstein
4b91c30a5a ovl: lookup connected ancestor of dir in inode cache
Decoding a dir file handle requires walking backward up to layer root and
for lower dir also checking the index to see if any of the parents have
been copied up.

Lookup overlay ancestor dentry in inode/dentry cache by decoded real
parents to shortcut looking up all the way back to layer root.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:26:05 +01:00
Amir Goldstein
7a9dadef96 ovl: hash non-indexed dir by upper inode for NFS export
Non-indexed upper dirs are encoded as upper file handles. When NFS export
is enabled, hash non-indexed directory inodes by upper inode, so we can
find them in inode cache using the decoded upper inode.

When NFS export is disabled, directories are not indexed on copy up, so
hash non-indexed directory inodes by origin inode, the same hash key
that is used before copy up.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:26:04 +01:00
Amir Goldstein
988925164f ovl: decode pure lower dir file handles
Similar to decoding a pure upper dir file handle, decoding a pure lower
dir file handle is implemented by looking an overlay dentry of the same
path as the pure lower path and verifying that the overlay dentry's
real lower matches the decoded real lower file handle.

Unlike the case of upper dir file handle, the lookup of overlay path by
lower real path can fail or find a mismatched overlay dentry if any of
the lower parents have been copied up and renamed. To address this case
we will need to check if any of the lower parents are indexed.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:26:04 +01:00
Amir Goldstein
3b0bfc6ed3 ovl: decode indexed dir file handles
Decoding an indexed dir file handle is done by looking up the file handle
in index dir by name and then decoding the upper dir from the index origin
file handle. The decoded upper path is used to lookup an overlay dentry of
the same path.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:26:03 +01:00
Amir Goldstein
9436a1a339 ovl: decode lower file handles of unlinked but open files
Lookup overlay inode in cache by origin inode, so we can decode a file
handle of an open file even if the index has a whiteout index entry to
mark this overlay inode was unlinked.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:26:03 +01:00
Amir Goldstein
f71bd9cfb6 ovl: decode indexed non-dir file handles
Decoding an indexed non-dir file handle is similar to decoding a lower
non-dir file handle, but additionally, we lookup the file handle in index
dir by name to find the real upper inode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:26:03 +01:00
Amir Goldstein
f941866fc4 ovl: decode lower non-dir file handles
Decoding a lower non-dir file handle is done by decoding the lower dentry
from underlying lower fs, finding or allocating an overlay inode that is
hashed by the real lower inode and instantiating an overlay dentry with
that inode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:26:02 +01:00
Amir Goldstein
03e1c584ff ovl: encode lower file handles
For indexed or lower non-dir, encode a non-connectable lower file handle
from origin inode. For indexed or lower dir, when ofs->numlower == 1,
encode a lower file handle from lower dir.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:26:02 +01:00
Amir Goldstein
05e1f11816 ovl: copy up before encoding non-connectable dir file handle
Decoding a merge dir, whose origin's parent is under a redirected
lower dir is not always possible. As a simple aproximation, we do
not encode lower dir file handles when overlay has multiple lower
layers and origin is below the topmost lower layer.

We should later relax this condition and copy up only the parent
that is under a redirected lower.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:26:01 +01:00
Amir Goldstein
b305e8443f ovl: encode non-indexed upper file handles
We only need to encode origin if there is a chance that the same object was
encoded pre copy up and then we need to stay consistent with the same
encoding also after copy up.

In case a non-pure upper is not indexed, then it was copied up before NFS
export support was enabled. In that case, we don't need to worry about
staying consistent with pre copy up encoding and we encode an upper file
handle.

This mitigates the problem that with no index, we cannot find an upper
inode from origin inode, so we cannot decode a non-indexed upper from
origin file handle.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:26:01 +01:00
Amir Goldstein
3985b70a3e ovl: decode connected upper dir file handles
Until this change, we decoded upper file handles by instantiating an
overlay dentry from the real upper dentry. This is sufficient to handle
pure upper files, but insufficient to handle merge/impure dirs.

To that end, if decoded real upper dir is connected and hashed, we
lookup an overlay dentry with the same path as the real upper dir.
If decoded real upper is non-dir, we instantiate a disconnected overlay
dentry as before this change.

Because ovl_fh_to_dentry() returns a connected overlay dir dentry,
exportfs never needs to call get_parent() and get_name() to reconnect an
upper overlay dir. Because connectable non-dir file handles are not
supported, exportfs will not be able to use fh_to_parent() and get_name()
methods to reconnect a disconnected non-dir to its parent. Therefore, the
methods get_parent() and get_name() are implemented just to print out a
sanity warning and the method fh_to_parent() is implemented to warn the
user that using the 'subtree_check' exportfs option is not supported.

An alternative approach could have been to implement instantiating of
an overlay directory inode from origin/index and implement get_parent()
and get_name() by calling into underlying fs operations and them
instantiating the overlay parent dir.

The reasons for not choosing the get_parent() approach were:
- Obtaining a disconnected overlay dir dentry would requires a
  delicate re-factoring of ovl_lookup() to get a dentry with overlay
  parent info. It was preferred to avoid doing that re-factoring unless
  it was proven worthy.
- Going down the path of disconnected dir would mean that the (non
  trivial) code path of d_splice_alias() could be traveled and that
  meant writing more tests and introduces race cases that are very hard
  to hit on purpose. Taking the path of connecting overlay dentry by
  forward lookup is therefore the safe and boring way to avoid surprises.

The culprits of the chosen "connected overlay dentry" approach:
- We need to take special care to rename of ancestors while connecting
  the overlay dentry by real dentry path. These subtleties are usually
  handled by generic exportfs and VFS code.
- In a hypothetical workload, we could end up in a loop trying to connect,
  interrupted by rename and restarting connect forever.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:26:00 +01:00
Amir Goldstein
8556a4205b ovl: decode pure upper file handles
Decoding an upper file handle is done by decoding the upper dentry from
underlying upper fs, finding or allocating an overlay inode that is
hashed by the real upper inode and instantiating an overlay dentry with
that inode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:26:00 +01:00
Amir Goldstein
8ed5eec9d6 ovl: encode pure upper file handles
Encode overlay file handles as struct ovl_fh containing the file handle
encoding of the real upper inode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:59 +01:00
Miklos Szeredi
f9c34674bc vfs: factor out helpers d_instantiate_anon() and d_alloc_anon()
Those helpers are going to be used by overlayfs to implement
NFS export decode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:59 +01:00
Amir Goldstein
c62520a83b ovl: store 'has_upper' and 'opaque' as bit flags
We need to make some room in struct ovl_entry to store information
about redirected ancestors for NFS export, so cram two booleans as
bit flags.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:58 +01:00
Amir Goldstein
aa3ff3c152 ovl: copy up of disconnected dentries
With NFS export, some operations on decoded file handles (e.g. open,
link, setattr, xattr_set) may call copy up with a disconnected non-dir.
In this case, we will copy up lower inode to index dir without
linking it to upper dir.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:58 +01:00
Amir Goldstein
829c28be9b ovl: use d_splice_alias() in place of d_add() in lookup
This is required for NFS export.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:57 +01:00
Amir Goldstein
0aceb53e73 ovl: do not pass overlay dentry to ovl_get_inode()
This is needed for using ovl_get_inode() for decoding file handles
for NFS export.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:57 +01:00
Amir Goldstein
91ffe7beb3 ovl: factor out ovl_get_index_fh() helper
The helper is needed to lookup an index by file handle for NFS export.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:56 +01:00
Amir Goldstein
24f0b17203 ovl: whiteout orphan index entries on mount
Orphan index entries are non-dir index entries whose union nlink count
dropped to zero. With index=on, orphan index entries are removed on
mount. With NFS export feature enabled, orphan index entries are replaced
with white out index entries to block future open by handle from opening
the lower file.

When dir index has a stale 'upper' xattr, we assume that the upper dir
was removed and we treat the dir index as orphan entry that needs to be
whited out or removed.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:56 +01:00
Amir Goldstein
e7dd0e7134 ovl: whiteout index when union nlink drops to zero
With NFS export feature enabled, when overlay inode nlink drops to
zero, instead of removing the index entry, replace it with a whiteout
index entry.

This is needed for NFS export in order to prevent future open by handle
from opening the lower file directly.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:56 +01:00
Amir Goldstein
89a17556ce ovl: cleanup dir index when dir nlink drops to zero
When non-dir index union nlink drops to zero the non-dir index
is cleaned. Do the same for directory type index entries when
union directory is removed.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:55 +01:00
Amir Goldstein
016b720f55 ovl: index directories on copy up for NFS export
With the NFS export feature enabled, all dirs are indexed on copy up.
Non-dir files are copied up directly to indexdir and then hardlinked
to upper dir.

Directories are copied up to indexdir, then an index entry is created
in indexdir with 'upper' xattr pointing to the copied up dir and then
the copied up dir is moved to upper dir.

Directory index is also used for consistency verification, like
detecting multiple redirected dirs to the same lower dir on lookup.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:55 +01:00
Amir Goldstein
fbd2d2074b ovl: index all non-dir on copy up for NFS export
With the NFS export feature enabled, all non-dir are indexed on copy up.
The copy up origin inode of an indexed non-dir can be used as a unique
identifier of the overlay object.

The full index is also used for consistency verfication, like detecting
multiple non-hardlink uppers with the same 'origin' on lookup.

Directory index on copy up will be implemented by following patch.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:54 +01:00
Amir Goldstein
24b33ee104 ovl: create ovl_need_index() helper
The helper determines which lower file needs to be indexed
on copy up and before nlink changes.

For index=on, the helper evaluates to true for lower hardlinks.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:54 +01:00
Amir Goldstein
9ee60ce249 ovl: cleanup temp index entries
A previous failed attempt to create or whiteout a directory index may
leave index entries named '#%x' in the index dir. Cleanup those temp
entries on mount instead of failing the mount.

In the future, we may drop 'work' dir and use 'index' dir instead.
This change is enough for cleaning up copy up leftovers 'from the future',
but it is not enough for cleaning up rmdir leftovers 'from the future'
(i.e. temp dir containing whiteouts).

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:53 +01:00
Amir Goldstein
e8f9e5b780 ovl: verify directory index entries on mount
Directory index entries should have 'upper' xattr pointing to the real
upper dir. Verifying that the upper dir file handle is not stale is
expensive, so only verify stale directory index entries on mount if
NFS export feature is enabled.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:53 +01:00
Amir Goldstein
7db25d36d9 ovl: verify whiteout index entries on mount
Whiteout index entries are used as an indication that an exported
overlay file handle should be treated as stale (i.e. after unlink
of the overlay inode).

Check on mount that whiteout index entries have a name that looks like
a valid file handle and cleanup invalid index entries.

For whiteout index entries, do not check that they also have valid
origin fh and nlink xattr, because those xattr do not exist for a
whiteout index entry.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:53 +01:00
Amir Goldstein
ad1d615cec ovl: use directory index entries for consistency verification
A directory index is a directory type entry in index dir with a
"trusted.overlay.upper" xattr containing an encoded ovl_fh of the merge
directory upper dir inode.

On lookup of non-dir files, lower file is followed by origin file handle.
On lookup of dir entries, lower dir is found by name and then compared
to origin file handle. We only trust dir index if we verified that lower
dir matches origin file handle, otherwise index may be inconsistent and
we ignore it.

If we find an indexed non-upper dir or an indexed merged dir, whose
index 'upper' xattr points to a different upper dir, that means that the
lower directory may be also referenced by another upper dir via redirect,
so we fail the lookup on inconsistency error.

To be consistent with directory index entries format, the association of
index dir to upper root dir, that was stored by older kernels in
"trusted.overlay.origin" xattr is now stored in "trusted.overlay.upper"
xattr. This also serves as an indication that overlay was mounted with a
kernel that support index directory entries. For backward compatibility,
if an 'origin' xattr exists on the index dir we also verify it on mount.

Directory index entries are going to be used for NFS export.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:52 +01:00
Amir Goldstein
86eaa13046 ovl: unbless lower st_ino of unverified origin
On a malformed overlay, several redirected dirs can point to the same
dir on a lower layer. This presents a similar challenge as broken
hardlinks, because different objects in the overlay can return the same
st_ino/st_dev pair from stat(2).

For broken hardlinks, we do not provide constant st_ino on copy up to
avoid this inconsistency. When NFS export feature is enabled, apply
the same logic to files and directories with unverified lower origin.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:52 +01:00
Amir Goldstein
37b12916c0 ovl: verify stored origin fh matches lower dir
When the NFS export feature is enabled, overlayfs implicitly enables the
feature "verify_lower". When the "verify_lower" feature is enabled, a
directory inode found in lower layer by name or by redirect_dir is
verified against the file handle of the copy up origin that is stored in
the upper layer.

This introduces a change of behavior for the case of lower layer
modification while overlay is offline. A lower directory created or
moved offline under an exisitng upper directory, will not be merged with
that upper directory.

The NFS export feature should not be used after copying layers, because
the new lower directory inodes would fail verification and won't be
merged with upper directories.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:51 +01:00
Amir Goldstein
f168f1098d ovl: add support for "nfs_export" configuration
Introduce the "nfs_export" config, module and mount options.

The NFS export feature depends on the "index" feature and enables two
implicit overlayfs features: "index_all" and "verify_lower".
The "index_all" feature creates an index on copy up of every file and
directory. The "verify_lower" feature uses the full index to detect
overlay filesystems inconsistencies on lookup, like redirect from
multiple upper dirs to the same lower dir.

NFS export can be enabled for non-upper mount with no index. However,
because lower layer redirects cannot be verified with the index, enabling
NFS export support on an overlay with no upper layer requires turning off
redirect follow (e.g. "redirect_dir=nofollow").

The full index may incur some overhead on mount time, especially when
verifying that lower directory file handles are not stale.

NFS export support, full index and consistency verification will be
implemented by following patches.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:37 +01:00
Amir Goldstein
60b866420b ovl: update documentation of inodes index feature
Document that inode index feature solves breaking hard links on
copy up.

Simplify Kconfig backward compatibility disclaimer.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 10:20:02 +01:00
Amir Goldstein
051224438a ovl: generalize ovl_verify_origin() and helpers
Remove the "origin" language from the functions that handle set, get
and verify of "origin" xattr and pass the xattr name as an argument.

The same helpers are going to be used for NFS export to get, get and
verify the "upper" xattr for directory index entries.

ovl_verify_origin() is now a helper used only to verify non upper
file handle stored in "origin" xattr of upper inode.

The upper root dir file handle is still stored in "origin" xattr on
the index dir for backward compatibility. This is going to be changed
by the patch that adds directory index entries support.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 10:19:54 +01:00
Amir Goldstein
1eff1a1dee ovl: simplify arguments to ovl_check_origin_fh()
Pass the fs instance with lower_layers array instead of the dentry
lowerstack array to ovl_check_origin_fh(), because the dentry members
of lowerstack play no role in this helper.

This change simplifies the argument list of ovl_check_origin(),
ovl_cleanup_index() and ovl_verify_index().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 10:19:46 +01:00
Amir Goldstein
2e1a532883 ovl: factor out ovl_check_origin_fh()
Re-factor ovl_check_origin() and ovl_get_origin(), so origin fh xattr is
read from upper inode only once during lookup with multiple lower layers
and only once when verifying index entry origin.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 10:19:35 +01:00
Amir Goldstein
d583ed7d13 ovl: store layer index in ovl_layer
Store the fs root layer index inside ovl_layer struct, so we can
get the root fs layer index from merge dir lower layer instead of
find it with ovl_find_layer() helper.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 10:19:25 +01:00
Amir Goldstein
972d0093c2 ovl: force r/o mount when index dir creation fails
When work dir creation fails, a warning is emitted and overlay is
mounted r/o. Trying to remount r/w will fail with no work dir.

When index dir creation fails, the same warning is emitted and overlay
is mounted r/o, but trying to remount r/w will succeed. This may cause
unintentional corruption of filesystem consistency.

Adjust the behavior of index dir creation failure to that of work dir
creation failure and do not allow to remount r/w. User needs to state
an explicitly intention to work without an index by mounting with
option 'index=off' to allow r/w mount with no index dir.

When mounting with option 'index=on' and no 'upperdir', index is
implicitly disabled, so do not warn about no file handle support.

The issue was introduced with inodes index feature in v4.13, but this
patch will not apply cleanly before ovl_fill_super() re-factoring in
v4.15.

Fixes: 02bcd15774 ("ovl: introduce the inodes index dir feature")
Cc: <stable@vger.kernel.org> #v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 10:19:14 +01:00
Amir Goldstein
a683737ba9 ovl: disable index when no xattr support
Overlayfs falls back to index=off if lower/upper fs does not support
file handles. Do the same if upper fs does not support xattr.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 10:19:07 +01:00
Amir Goldstein
9678e63030 ovl: fix inconsistent d_ino for legacy merge dir
For a merge dir that was copied up before v4.12 or that was hand crafted
offline (e.g. mkdir {upper/lower}/dir), upper dir does not contain the
'trusted.overlay.origin' xattr.  In that case, stat(2) on the merge dir
returns the lower dir st_ino, but getdents(2) returns the upper dir d_ino.

After this change, on merge dir lookup, missing origin xattr on upper
dir will be fixed and 'impure' xattr will be fixed on parent of the legacy
merge dir.

Suggested-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 10:18:19 +01:00
David S. Miller
5ca114400d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
en_rx_am.c was deleted in 'net-next' but had a bug fixed in it in
'net'.

The esp{4,6}_offload.c conflicts were overlapping changes.
The 'out' label is removed so we just return ERR_PTR(-EINVAL)
directly.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-23 13:51:56 -05:00
Bob Peterson
805c090750 GFS2: Log the reason for log flushes in every log header
This patch just adds the capability for GFS2 to track which function
called gfs2_log_flush. This should make it easier to diagnose
problems based on the sequence of events found in the journals.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-01-23 07:39:20 -07:00
Bob Peterson
c1696fb85d GFS2: Introduce new gfs2_log_header_v2
This patch adds a new structure called gfs2_log_header_v2 which is used
to store expanded fields into previously unused areas of the log headers
(i.e., this change is backwards compatible).  Some of these are used for
debug purposes so we can backtrack when problems occur.  Others are
reserved for future expansion.

This patch is based on a prototype from Steve Whitehouse.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-01-23 07:38:53 -07:00
Greg Kroah-Hartman
78fae52cf4 sysfs: remove DEBUG defines
It isn't needed at all in these files, dynamic debug is the best way to
enable this type of thing, if you really want it.  As it is, these
defines were not doing anything at all.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 10:19:23 +01:00
Greg Kroah-Hartman
619daeeeb8 sysfs: use SPDX identifiers
Move the license "mark" of the sysfs files to be in SPDX form, instead
of the custom text that it currently is in.  This is in a quest to get
rid of the 700+ different ways we say "GPLv2" in the kernel tree.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 10:19:10 +01:00
Ben Hutchings
1995266727 nfsd: auth: Fix gid sorting when rootsquash enabled
Commit bdcf0a423e ("kernel: make groups_sort calling a responsibility
group_info allocators") appears to break nfsd rootsquash in a pretty
major way.

It adds a call to groups_sort() inside the loop that copies/squashes
gids, which means the valid gids are sorted along with the following
garbage.  The net result is that the highest numbered valid gids are
replaced with any lower-valued garbage gids, possibly including 0.

We should sort only once, after filling in all the gids.

Fixes: bdcf0a423e ("kernel: make groups_sort calling a responsibility ...")
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-22 20:13:07 -08:00
Jaegeuk Kim
f236792311 f2fs: allow to recover node blocks given updated checkpoint
If fsck.f2fs changes crc, we have no way to recover some inode blocks by roll-
forward recovery. Let's relax the condition to recover them.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:59 -08:00
Jaegeuk Kim
37a086f015 f2fs: recover some i_inline flags
This fixes lost i_inline flags during roll-forward.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:58 -08:00
Daeho Jeong
b2c4692bc2 f2fs: correct removexattr behavior for null valued extended attribute
__vfs_removexattr() transfers "NULL" value to the setxattr handler of
the f2fs filesystem in order to remove the extended attribute. But,
__f2fs_setxattr() just ignores the removal request when the value of
the extended attribute is already NULL. We have to remove the extended
attribute itself even if the value of that is already NULL.

We can reporduce this bug with the below:

1. touch file
2. setfattr -n "user.foo" file
3. setfattr -x "user.foo" file
4. getfattr -d file
> user.foo

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Tested-by: Hobin Woo <hobin.woo@samsung.com>
Tested-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:57 -08:00
Chao Yu
db198ae0f8 f2fs: drop page cache after fs shutdown
Don't remain dirtied page cache in f2fs after shutdown, it can mitigate
memory pressure of whole system, in order to keep other modules working
properly.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:56 -08:00
Chao Yu
7950e9ac63 f2fs: stop gc/discard thread after fs shutdown
Once filesystem shuts down, daemons like gc/discard thread should be
aware of it, and do exit, in addtion, drop all cached pending discard
commands and turn off real-time discard mode.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:55 -08:00
Chao Yu
d027c48447 f2fs: hanlde error case in f2fs_ioc_shutdown
This patch makes f2fs_ioc_shutdown handling error case correctly.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:54 -08:00
Chao Yu
bb9e3bb8db f2fs: split need_inplace_update
This patch splits need_inplace_update to two functions:
a. should_update_inplace() includes all conditions that we must use IPU.
b. should_update_outplace() includes all conditions that we must use OPU.

So that, in f2fs_ioc_set_pin_file() and f2fs_defragment_range(), we can
use corresponding function to check whether we can trigger OPU/IPU or not.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:53 -08:00
Chao Yu
eb4497975e f2fs: fix to update last_disk_size correctly
This patch fixes to update last_disk_size only when writing out page
successfully.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:52 -08:00
Chao Yu
b323fd28bb f2fs: kill F2FS_INLINE_XATTR_ADDRS for cleanup
Use get_inline_xattr_addrs directly instead of F2FS_INLINE_XATTR_ADDRS.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:51 -08:00
Chao Yu
d7997e6368 f2fs: clean up error path of fill_super
This patch cleans up error path of fille_super to avoid unneeded
release step.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:50 -08:00
Sheng Yong
a9d572c755 f2fs: avoid hungtask when GC encrypted block if io_bits is set
When io_bits is set, GCing encrypted block may hit the following hungtask.
Since io_bits requires aligned block address, f2fs_submit_page_write may
return -EAGAIN if new_blkaddr does not satisify io_bits alignment. As a
result, the encrypted page will never be writtenback.

This patch makes move_data_block aware the EAGAIN error and cancel the
writeback.

[  246.751371] INFO: task kworker/u4:4:797 blocked for more than 90 seconds.
[  246.752423]       Not tainted 4.15.0-rc4+ #11
[  246.754176] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  246.755336] kworker/u4:4    D25448   797      2 0x80000000
[  246.755597] Workqueue: writeback wb_workfn (flush-7:0)
[  246.755616] Call Trace:
[  246.755695]  ? __schedule+0x322/0xa90
[  246.755761]  ? blk_init_request_from_bio+0x120/0x120
[  246.755773]  ? pci_mmcfg_check_reserved+0xb0/0xb0
[  246.755801]  ? __radix_tree_create+0x19e/0x200
[  246.755813]  ? delete_node+0x136/0x370
[  246.755838]  schedule+0x43/0xc0
[  246.755904]  io_schedule+0x17/0x40
[  246.755939]  wait_on_page_bit_common+0x17b/0x240
[  246.755950]  ? wake_page_function+0xa0/0xa0
[  246.755961]  ? add_to_page_cache_lru+0x160/0x160
[  246.755972]  ? page_cache_tree_insert+0x170/0x170
[  246.755983]  ? __lru_cache_add+0x96/0xb0
[  246.756086]  __filemap_fdatawait_range+0x14f/0x1c0
[  246.756097]  ? wait_on_page_bit_common+0x240/0x240
[  246.756120]  ? __wake_up_locked_key_bookmark+0x20/0x20
[  246.756167]  ? wait_on_all_pages_writeback+0xc9/0x100
[  246.756179]  ? __remove_ino_entry+0x120/0x120
[  246.756192]  ? wait_woken+0x100/0x100
[  246.756204]  filemap_fdatawait_range+0x9/0x20
[  246.756216]  write_checkpoint+0x18a1/0x1f00
[  246.756254]  ? blk_get_request+0x10/0x10
[  246.756265]  ? cpumask_next_and+0x43/0x60
[  246.756279]  ? f2fs_sync_inode_meta+0x160/0x160
[  246.756289]  ? remove_element.isra.4+0xa0/0xa0
[  246.756300]  ? __put_compound_page+0x40/0x40
[  246.756310]  ? f2fs_sync_fs+0xec/0x1c0
[  246.756320]  ? f2fs_sync_fs+0x120/0x1c0
[  246.756329]  f2fs_sync_fs+0x120/0x1c0
[  246.756357]  ? trace_event_raw_event_f2fs__page+0x260/0x260
[  246.756393]  ? ata_build_rw_tf+0x173/0x410
[  246.756397]  f2fs_balance_fs_bg+0x198/0x390
[  246.756405]  ? drop_inmem_page+0x230/0x230
[  246.756415]  ? ahci_qc_prep+0x1bb/0x2e0
[  246.756418]  ? ahci_qc_issue+0x1df/0x290
[  246.756422]  ? __accumulate_pelt_segments+0x42/0xd0
[  246.756426]  ? f2fs_write_node_pages+0xd1/0x380
[  246.756429]  f2fs_write_node_pages+0xd1/0x380
[  246.756437]  ? sync_node_pages+0x8f0/0x8f0
[  246.756440]  ? update_curr+0x53/0x220
[  246.756444]  ? __accumulate_pelt_segments+0xa2/0xd0
[  246.756448]  ? __update_load_avg_se.isra.39+0x349/0x360
[  246.756452]  ? do_writepages+0x2a/0xa0
[  246.756456]  do_writepages+0x2a/0xa0
[  246.756460]  __writeback_single_inode+0x70/0x490
[  246.756463]  ? check_preempt_wakeup+0x199/0x310
[  246.756467]  writeback_sb_inodes+0x2a2/0x660
[  246.756471]  ? is_empty_dir_inode+0x40/0x40
[  246.756474]  ? __writeback_single_inode+0x490/0x490
[  246.756477]  ? string+0xbf/0xf0
[  246.756480]  ? down_read_trylock+0x35/0x60
[  246.756484]  __writeback_inodes_wb+0x9f/0xf0
[  246.756488]  wb_writeback+0x41d/0x4b0
[  246.756492]  ? writeback_inodes_wb.constprop.55+0x150/0x150
[  246.756498]  ? set_worker_desc+0xf7/0x130
[  246.756502]  ? current_is_workqueue_rescuer+0x60/0x60
[  246.756511]  ? _find_next_bit+0x2c/0xa0
[  246.756514]  ? wb_workfn+0x400/0x5d0
[  246.756518]  wb_workfn+0x400/0x5d0
[  246.756521]  ? finish_task_switch+0xdf/0x2a0
[  246.756525]  ? inode_wait_for_writeback+0x30/0x30
[  246.756529]  process_one_work+0x3a7/0x6f0
[  246.756533]  worker_thread+0x82/0x750
[  246.756537]  kthread+0x16f/0x1c0
[  246.756541]  ? trace_event_raw_event_workqueue_work+0x110/0x110
[  246.756544]  ? kthread_create_worker_on_cpu+0xb0/0xb0
[  246.756548]  ret_from_fork+0x1f/0x30

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:49 -08:00
Jaegeuk Kim
d8a9a22992 f2fs: allow quota to use reserved blocks
This patch allows quota to use reserved blocks all the time.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:48 -08:00
Chao Yu
a2e2e76b23 f2fs: fix to drop all inmem pages correctly
In commit 57864ae5ce ("f2fs: limit # of inmemory pages"), we have
limited memory footprint of all inmem pages with 20% of total memory,
otherwise, if we exceed the threshold, we will try to drop all inmem
pages to avoid excessive memory pressure resulting in performance
regression.

But in some unrelated error paths, we will also drop all inmem pages,
which should be wrong, fix it in this patch.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:47 -08:00
Chao Yu
f3d98e74fc f2fs: speed up defragment on sparse file
We have supported to get next page offset with valid mapping crossing
hole in f2fs_map_blocks, utilizing it to speed up defragment on sparse
file.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:46 -08:00
Chao Yu
c4020b2da4 f2fs: support F2FS_IOC_PRECACHE_EXTENTS
This patch introduces a new ioctl F2FS_IOC_PRECACHE_EXTENTS to precache
extent info like ext4, in order to gain better performance during
triggering AIO by eliminating synchronous waiting of mapping info.

Referred commit: 7869a4a6c5 ("ext4: add support for extent pre-caching")

In addition, with newly added extent precache abilitiy, this patch add
to support FIEMAP_FLAG_CACHE in ->fiemap.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:45 -08:00
Jaegeuk Kim
1ad71a2712 f2fs: add an ioctl to disable GC for specific file
This patch gives a flag to disable GC on given file, which would be useful, when
user wants to keep its block map. It also conducts in-place-update for dontmove
file.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-22 14:56:35 -08:00
Martin Brandenburg
a0ec1ded22 orangefs: initialize op on loop restart in orangefs_devreq_read
In orangefs_devreq_read, there is a loop which picks an op off the list
of pending ops.  If the loop fails to find an op, there is nothing to
read, and it returns EAGAIN.  If the op has been given up on, the loop
is restarted via a goto.  The bug is that the variable which the found
op is written to is not reinitialized, so if there are no more eligible
ops on the list, the code runs again on the already handled op.

This is triggered by interrupting a process while the op is being copied
to the client-core.  It's a fairly small window, but it's there.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-22 13:51:14 -08:00
Martin Brandenburg
0afc0decf2 orangefs: use list_for_each_entry_safe in purge_waiting_ops
set_op_state_purged can delete the op.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-22 13:51:14 -08:00
Anand Jain
f2788d2f76 btrfs: set the total_devices in device_list_add()
There is no other parent for device_list_add() except for
btrfs_scan_one_device(), which would set btrfs_fs_devices::total_devices
if device_list_add is successful and this can be done with in
device_list_add() itself.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 20:25:56 +01:00
Anand Jain
327f18cc7f btrfs: move pr_info into device_list_add
Commit 60999ca4b4 ("btrfs: make device scan less noisy")
adds return value 1 to device_list_add(), so that parent function can
call pr_info only when new device is added. Move the pr_info() part
into device_list_add() so that this function can be kept simple.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 20:25:54 +01:00
Anand Jain
d8367db30a btrfs: make btrfs_free_stale_devices() to match the path
The btrfs_free_stale_devices() is updated to match for the given device
path and delete it. (It searches for only unmounted list of devices.)
Also drop the comment about different path being used for the same
device, since now we will have cli to clean any device that's not a
concern any more.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 20:25:52 +01:00
Anand Jain
0d34097f66 btrfs: rename btrfs_free_stale_devices() arg to skip_dev
No functional changes.
Rename btrfs_free_stale_devices() arg to skip_dev, so that it
reflects what that arg for.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 20:25:50 +01:00
Anand Jain
522f1b45e4 btrfs: make btrfs_free_stale_devices() argument optional
This updates btrfs_free_stale_devices() helper function to delete all
unmouted devices, when arg is NULL.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 20:25:48 +01:00
Anand Jain
38cf665d33 btrfs: make btrfs_free_stale_device() to iterate all stales
Let the list iterator iterate further and find other stale
devices and delete it. This is in preparation to add support
for user land request-able stale devices cleanup. Also rename
btrfs_free_stale_device() to btrfs_free_stale_devices().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 20:25:47 +01:00
Anand Jain
a848b3e547 btrfs: no need to check for btrfs_fs_devices::seeding
There is no need to check for btrfs_fs_devices::seeding when we
have checked for btrfs_fs_devices::opened, because we can't sprout
without its seed FS being opened.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 20:25:44 +01:00
Greg Kroah-Hartman
5d54f948aa sysfs: turn WARN() into pr_warn()
It's not good to crash the machine if panic_on_warn() is set just
because someone made a stupid mistake of trying to create a sysfs file
with the same name of an existing one.  This makes the automated testing
tools a lot harder to find the real bugs in the kernel.

So just print a warning out and dump the stack to get the attention of
the developer that they did something foolish.  Then keep on trucking,
as this should not be a fatal error at all.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-22 16:11:12 +01:00
Nikolay Borisov
b03ebd992f btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it
No functional changes, just makes the code more readable

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:22 +01:00
Liu Bo
5f4791f4a6 Btrfs: noinline merge_extent_mapping
In order to debug subtle bugs around merge_extent_mapping(), perf probe
can be used to check the arguments, but sometimes merge_extent_mapping()
got inlined by compiler and couldn't be probed.

This is adding noinline attribute to merge_extent_mapping().

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:22 +01:00
Liu Bo
9a7e10e7ba Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping
This is a subtle case, so in order to understand the problem, it'd be good
to know the content of existing and em when any error occurs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:22 +01:00
Liu Bo
cd77f4f836 Btrfs: extent map selftest: dio write vs dio read
This test case simulates the racy situation of dio write vs dio read,
and see if btrfs_get_extent() would return -EEXIST.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:22 +01:00
Liu Bo
fd87526fad Btrfs: extent map selftest: buffered write vs dio read
This test case simulates the racy situation of buffered write vs dio
read, and see if btrfs_get_extent() would return -EEXIST.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:22 +01:00
Liu Bo
72b28077a2 Btrfs: add extent map selftests
We've observed that btrfs_get_extent() and merge_extent_mapping() could
return -EEXIST in several cases, and they are caused by some racy
condition, e.g dio read vs dio write, which makes the problem very tricky
to reproduce.

This adds extent map selftests in order to simulate those racy situations.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
[ minor string adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:22 +01:00
Liu Bo
c04e61b5e4 Btrfs: move extent map specific code to extent_map.c
These helpers are extent map specific, move them to extent_map.c.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:22 +01:00
Liu Bo
7b4df058b0 Btrfs: add helper for em merge logic
This is a prepare work for the following extent map selftest, which
runs tests against em merge logic.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Liu Bo
18e83ac75b Btrfs: fix unexpected EEXIST from btrfs_get_extent
This fixes a corner case that is caused by a race of dio write vs dio
read/write.

Here is how the race could happen.

Suppose that no extent map has been loaded into memory yet.
There is a file extent [0, 32K), two jobs are running concurrently
against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio
read from [0, 4K) or [4K, 8K).

t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K).

------------------------------------------------------
             t1                                t2
      btrfs_get_blocks_direct()         btrfs_get_blocks_direct()
       -> btrfs_get_extent()              -> btrfs_get_extent()
           -> lookup_extent_mapping()
           -> add_extent_mapping()            -> lookup_extent_mapping()
              # load [0, 32K)
       -> btrfs_new_extent_direct()
           -> btrfs_drop_extent_cache()
              # split [0, 32K) and
	      # drop [8K, 32K)
           -> add_extent_mapping()
              # add [8K, 32K)
                                              -> add_extent_mapping()
                                                 # handle -EEXIST when adding
                                                 # [0, 32K)
------------------------------------------------------
About how t2(dio read/write) runs into -EEXIST:

a) add_extent_mapping() gets -EEXIST for adding em [0, 32k),

b) search_extent_mapping() then returns [0, 8k) as the existing em,
   even though start == existing->start, em is [0, 32k) so that
   extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k,

c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k)
   (with a length 0) and returns -EEXIST as [8k, 32k) is already in tree,

d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write,
   which is confusing applications.

Here I conclude all the possible situations,
1) start < existing->start

            +-----------+em+-----------+
+--prev---+ |     +-------------+      |
|         | |     |             |      |
+---------+ +     +---+existing++      ++
                +
                |
                +
             start

2) start == existing->start

      +------------em------------+
      |     +-------------+      |
      |     |             |      |
      +     +----existing-+      +
            |
            |
            +
         start

3) start > existing->start && start < (existing->start + existing->len)

      +------------em------------+
      |     +-------------+      |
      |     |             |      |
      +     +----existing-+      +
               |
               |
               +
             start

4) start >= (existing->start + existing->len)

+-----------+em+-----------+
|     +-------------+      | +--next---+
|     |             |      | |         |
+     +---+existing++      + +---------+
                      +
                      |
                      +
                   start

As we can see, it turns out that if start is within existing em (front
inclusive), then the existing em should be returned as is, otherwise,
we try our best to merge candidate em with sibling ems to form a
larger em (in order to reduce the total number of em).

Reported-by: David Vallender <david.vallender@landmark.co.uk>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Liu Bo
a520a7e0b5 Btrfs: fix incorrect block_len in merge_extent_mapping
%block_len could be checked on deciding if two em are mergeable.

merge_extent_mapping() has only added the front pad if the front part
of em gets truncated, but it's possible that the end part gets
truncated.

For both compressed extent and inline extent, em->block_len is not
adjusted accordingly, and for regular extent, em->block_len always
equals to em->len, hence this sets em->block_len with em->len.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Matthew Wilcox
3cbf26da5e btrfs: Remove unused readahead spinlock
The reada_lock in struct btrfs_device was only initialised, and not
actually used.  That's good because there's another lock also called
reada_lock in the btrfs_fs_info that was quite heavily used.  Remove
this one.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Liu Bo
7583d8d088 Btrfs: raid56: fix race between merge_bio and rbio_orig_end_io
Before rbio_orig_end_io() goes to free rbio, rbio may get merged with
more bios from other rbios and rbio->bio_list becomes non-empty,
in that case, these newly merged bios don't end properly.

Once unlock_stripe() is done, rbio->bio_list will not be updated any
more and we can call bio_endio() on all queued bios.

It should only happen in error-out cases, the normal path of recover
and full stripe write have already set RBIO_RMW_LOCKED_BIT to disable
merge before doing IO, so rbio_orig_end_io() called by them doesn't
have the above issue.

Reported-by: Jérôme Carretero <cJ-ko@zougloub.eu>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Liu Bo
44ac474def Btrfs: do not cache rbio pages if using raid6 recover
Since raid6 recover tries all possible combinations of failed stripes,

- when raid6 rebuild algorithm is used, i.e. raid6_datap_recov() and
  raid6_2data_recov(), it may change the in-memory content of failed
  stripes, if such a raid bio is cached, a later raid write rmw or recover
  can steal @stripe_pages from it instead of reading from disks, such that
  it carries the wrong content to do write rmw or recovery and ends up
  with corruption or recovery failures.

- when raid5 rebuild algorithm is used, i.e. xor, raid bio can be cached
  because the only failed stripe which contains @rbio->bio_pages gets
  modified, others remain the same so that their in-memory content is
  consistent with their on-disk content.

This adds a check to skip caching rbio if using raid6 recover.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Liu Bo
0198e5b707 Btrfs: raid56: iterate raid56 internal bio with bio_for_each_segment_all
Bio iterated by set_bio_pages_uptodate() is raid56 internal one, so it
will never be a BIO_CLONED bio, and since this is called by end_io
functions, bio->bi_iter.bi_size is zero, we mustn't use
bio_for_each_segment() as that is a no-op if bi_size is zero.

Fixes: 6592e58c6b ("Btrfs: fix write corruption due to bio cloning on raid5/6")
Cc: <stable@vger.kernel.org> # v4.12-rc6+
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Su Yue
df6703e15c btrfs: correct wrong comment about magic number of index_cnt
There is no function named btrfs_get_inode_index_count.
Explanation for magic number index_cnt=2 in btrfs_new_inode() is
actually located in btrfs_set_inode_index_count().

So replace 'btrfs_get_inode_index_count' in the comment by
'btrfs_set_inode_index_count'.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Nikolay Borisov
d2560ebd23 btrfs: Make btrfs_inode_rsv_release static
It's not used outside of extent-tree so there is no reason to not be
static.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Anand Jain
1c94da9dd9 btrfs: cleanup btrfs_free_stale_device() usage
We call btrfs_free_stale_device() only when we alloc a new struct
btrfs_device (ret=1), so move it closer to where we alloc the new
device. Also drop the comments.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
David Sterba
e2683fc9d2 btrfs: tree-check: reduce stack consumption in check_dir_item
I've noticed that the updated item checker stack consumption increased
dramatically in 542f5385e20cf97447 ("btrfs: tree-checker: Add checker
for dir item")

tree-checker.c:check_leaf                    +552 (176 -> 728)

The array is 255 bytes long, dynamic allocation would slow down the
sanity checks so it's more reasonable to keep it on-stack. Moving the
variable to the scope of use reduces the stack usage again

tree-checker.c:check_leaf                    -264 (728 -> 464)

Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Xiongfeng Wang
6670d4c2d9 btrfs: use correct string length in DEV_INFO ioctl
gcc-8 reports:

fs/btrfs/ioctl.c: In function 'btrfs_ioctl':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified
bound 1024 equals destination size [-Wstringop-truncation]

We need one less byte or call strlcpy() to make it a nul-terminated
string. This is done on the next line anyway, but we want to avoid the
warning.

Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Anand Jain
6f794e3c5c btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPP
It appears from the original commit [1] that there isn't any design
specific reason not to fail the mount instead of just warning. This
patch will change it to fail.

[1]
 commit 319e4d0661
    btrfs: Enhance super validation check

Fixes: 319e4d0661 ("btrfs: Enhance super validation check")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Anand Jain
e2731e5588 btrfs: define SUPER_FLAG_METADUMP_V2
btrfs-progs uses super flag bit BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34).
So just define that in kernel so that we know its been used.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Liu Bo
a6f93c71d4 Btrfs: avoid losing data raid profile when deleting a device
We've avoided data losing raid profile when doing balance, but it
turns out that deleting a device could also result in the same
problem.

Say we have 3 disks, and they're created with '-d raid1' profile.

- We have chunk P (the only data chunk on the empty btrfs).

- Suppose that chunk P's two raid1 copies reside in disk A and disk B.

- Now, 'btrfs device remove disk B'
         btrfs_rm_device()
	   -> btrfs_shrink_device()
	      -> btrfs_relocate_chunk() #relocate any chunk on disk B
	      	 			 to other places.

- Chunk P will be removed and a new chunk will be created to hold
  those data, but as chunk P is the only one holding raid1 profile,
  after it goes away, the new chunk will be created as single profile
  which is our default profile.

This fixes the problem by creating an empty data chunk before
relocating the data chunk.

Metadata/System chunk are supposed to have non-zero bytes all the time
so their raid profile is preserved.

Reported-by: James Alandt <James.Alandt@wdc.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Filipe Manana
81fdf6382b Btrfs: fix space leak after fallocate and zero range operations
If we do a buffered write after a zero range operation that has an
unaligned (with the filesystem's sector size) end which also falls within
an unwritten (prealloc) extent that is currently beyond the inode's
i_size, and the zero range operation has the flag FALLOC_FL_KEEP_SIZE,
we end up leaking data and metadata space. This happens because when
zeroing a range we call btrfs_truncate_block(), which does delalloc
(loads the page and partially zeroes its content), and in the buffered
write path we only clear existing delalloc space reservation for the
range we are writing into if that range starts at an offset smaller then
the inode's i_size, which makes sense since we can not have delalloc
extents beyond the i_size, only unwritten extents are allowed.

Example reproducer:

 $ mkfs.btrfs -f /dev/sdb
 $ mount /dev/sdb /mnt
 $ xfs_io -f -c "falloc -k 428K 4K" /mnt/foobar
 $ xfs_io -c "fzero -k 0 430K" /mnt/foobar
 $ xfs_io -c "pwrite -S 0xaa 428K 4K" /mnt/foobar
 $ umount /mnt

After the unmount we get the metadata and data space leaks reported in
dmesg/syslog:

 [95794.602253] ------------[ cut here ]------------
 [95794.603322] WARNING: CPU: 0 PID: 31496 at fs/btrfs/inode.c:9561 btrfs_destroy_inode+0x4e/0x206 [btrfs]
 [95794.605167] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
 [95794.613000] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
 [95794.614448] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 [95794.615972] task: ffff880075aa0240 task.stack: ffffc90001734000
 [95794.617114] RIP: 0010:btrfs_destroy_inode+0x4e/0x206 [btrfs]
 [95794.618001] RSP: 0018:ffffc90001737d00 EFLAGS: 00010202
 [95794.618721] RAX: 0000000000000000 RBX: ffff880070fa1418 RCX: ffffc90001737c7c
 [95794.619645] RDX: 0000000175aa0240 RSI: 0000000000000001 RDI: ffff880070fa1418
 [95794.620711] RBP: ffffc90001737d38 R08: 0000000000000000 R09: 0000000000000000
 [95794.621932] R10: ffffc90001737c48 R11: ffff88007123e158 R12: ffff880075b6a000
 [95794.623124] R13: ffff88006145c000 R14: ffff880070fa1418 R15: ffff880070c3b4a0
 [95794.624188] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [95794.625578] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [95794.626522] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
 [95794.627647] Call Trace:
 [95794.628128]  destroy_inode+0x3d/0x55
 [95794.628573]  evict+0x177/0x17e
 [95794.629010]  dispose_list+0x50/0x71
 [95794.629478]  evict_inodes+0x132/0x141
 [95794.630289]  generic_shutdown_super+0x3f/0x10b
 [95794.630864]  kill_anon_super+0x12/0x1c
 [95794.631383]  btrfs_kill_super+0x16/0x21 [btrfs]
 [95794.631930]  deactivate_locked_super+0x30/0x68
 [95794.632539]  deactivate_super+0x36/0x39
 [95794.633200]  cleanup_mnt+0x49/0x67
 [95794.633818]  __cleanup_mnt+0x12/0x14
 [95794.634416]  task_work_run+0x82/0xa6
 [95794.634902]  prepare_exit_to_usermode+0xe1/0x10c
 [95794.635525]  syscall_return_slowpath+0x18c/0x1af
 [95794.636122]  entry_SYSCALL_64_fastpath+0xab/0xad
 [95794.636834] RIP: 0033:0x7fa678cb99a7
 [95794.637370] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [95794.638672] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
 [95794.639596] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
 [95794.640703] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
 [95794.641773] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
 [95794.643150] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
 [95794.644249] Code: ff 4c 8b a8 80 06 00 00 48 8b 87 c0 01 00 00 48 85 c0 74 02 0f ff 48 83 bb e0 02 00 00 00 74 02 0f ff 83 bb 3c ff ff ff 00 74 02 <0f> ff 83 bb 40 ff ff ff 00 74 02 0f ff 48 83 bb f8 fe ff ff 00
 [95794.646929] ---[ end trace e95877675c6ec007 ]---
 [95794.647751] ------------[ cut here ]------------
 [95794.648509] WARNING: CPU: 0 PID: 31496 at fs/btrfs/inode.c:9562 btrfs_destroy_inode+0x59/0x206 [btrfs]
 [95794.649842] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
 [95794.654659] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
 [95794.655894] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 [95794.657546] task: ffff880075aa0240 task.stack: ffffc90001734000
 [95794.658433] RIP: 0010:btrfs_destroy_inode+0x59/0x206 [btrfs]
 [95794.659279] RSP: 0018:ffffc90001737d00 EFLAGS: 00010202
 [95794.660054] RAX: 0000000000000000 RBX: ffff880070fa1418 RCX: ffffc90001737c7c
 [95794.660753] RDX: 0000000175aa0240 RSI: 0000000000000001 RDI: ffff880070fa1418
 [95794.661513] RBP: ffffc90001737d38 R08: 0000000000000000 R09: 0000000000000000
 [95794.662289] R10: ffffc90001737c48 R11: ffff88007123e158 R12: ffff880075b6a000
 [95794.663393] R13: ffff88006145c000 R14: ffff880070fa1418 R15: ffff880070c3b4a0
 [95794.664342] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [95794.665673] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [95794.666593] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
 [95794.667629] Call Trace:
 [95794.668065]  destroy_inode+0x3d/0x55
 [95794.668637]  evict+0x177/0x17e
 [95794.669179]  dispose_list+0x50/0x71
 [95794.669830]  evict_inodes+0x132/0x141
 [95794.670416]  generic_shutdown_super+0x3f/0x10b
 [95794.671103]  kill_anon_super+0x12/0x1c
 [95794.671786]  btrfs_kill_super+0x16/0x21 [btrfs]
 [95794.672552]  deactivate_locked_super+0x30/0x68
 [95794.673393]  deactivate_super+0x36/0x39
 [95794.674107]  cleanup_mnt+0x49/0x67
 [95794.674706]  __cleanup_mnt+0x12/0x14
 [95794.675279]  task_work_run+0x82/0xa6
 [95794.675795]  prepare_exit_to_usermode+0xe1/0x10c
 [95794.676507]  syscall_return_slowpath+0x18c/0x1af
 [95794.677275]  entry_SYSCALL_64_fastpath+0xab/0xad
 [95794.678006] RIP: 0033:0x7fa678cb99a7
 [95794.678600] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [95794.679739] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
 [95794.680779] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
 [95794.681837] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
 [95794.682867] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
 [95794.683891] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
 [95794.684843] Code: c0 01 00 00 48 85 c0 74 02 0f ff 48 83 bb e0 02 00 00 00 74 02 0f ff 83 bb 3c ff ff ff 00 74 02 0f ff 83 bb 40 ff ff ff 00 74 02 <0f> ff 48 83 bb f8 fe ff ff 00 74 02 0f ff 48 83 bb 00 ff ff ff
 [95794.687156] ---[ end trace e95877675c6ec008 ]---
 [95794.687876] ------------[ cut here ]------------
 [95794.688579] WARNING: CPU: 0 PID: 31496 at fs/btrfs/inode.c:9565 btrfs_destroy_inode+0x7d/0x206 [btrfs]
 [95794.689735] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
 [95794.695015] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
 [95794.696396] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 [95794.697956] task: ffff880075aa0240 task.stack: ffffc90001734000
 [95794.698925] RIP: 0010:btrfs_destroy_inode+0x7d/0x206 [btrfs]
 [95794.699763] RSP: 0018:ffffc90001737d00 EFLAGS: 00010206
 [95794.700434] RAX: 0000000000000000 RBX: ffff880070fa1418 RCX: ffffc90001737c7c
 [95794.701445] RDX: 0000000175aa0240 RSI: 0000000000000001 RDI: ffff880070fa1418
 [95794.702448] RBP: ffffc90001737d38 R08: 0000000000000000 R09: 0000000000000000
 [95794.703557] R10: ffffc90001737c48 R11: ffff88007123e158 R12: ffff880075b6a000
 [95794.704441] R13: ffff88006145c000 R14: ffff880070fa1418 R15: ffff880070c3b4a0
 [95794.705270] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [95794.706341] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [95794.707001] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
 [95794.708030] Call Trace:
 [95794.708466]  destroy_inode+0x3d/0x55
 [95794.709071]  evict+0x177/0x17e
 [95794.709497]  dispose_list+0x50/0x71
 [95794.709973]  evict_inodes+0x132/0x141
 [95794.710564]  generic_shutdown_super+0x3f/0x10b
 [95794.711200]  kill_anon_super+0x12/0x1c
 [95794.711633]  btrfs_kill_super+0x16/0x21 [btrfs]
 [95794.712139]  deactivate_locked_super+0x30/0x68
 [95794.712608]  deactivate_super+0x36/0x39
 [95794.713093]  cleanup_mnt+0x49/0x67
 [95794.713514]  __cleanup_mnt+0x12/0x14
 [95794.713933]  task_work_run+0x82/0xa6
 [95794.714543]  prepare_exit_to_usermode+0xe1/0x10c
 [95794.715247]  syscall_return_slowpath+0x18c/0x1af
 [95794.715952]  entry_SYSCALL_64_fastpath+0xab/0xad
 [95794.716653] RIP: 0033:0x7fa678cb99a7
 [95794.721100] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [95794.722052] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
 [95794.722856] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
 [95794.723698] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
 [95794.724736] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
 [95794.725928] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
 [95794.726728] Code: 40 ff ff ff 00 74 02 0f ff 48 83 bb f8 fe ff ff 00 74 02 0f ff 48 83 bb 00 ff ff ff 00 74 02 0f ff 48 83 bb 30 ff ff ff 00 74 02 <0f> ff 48 83 bb 08 ff ff ff 00 74 02 0f ff 4d 85 e4 0f 84 52 01
 [95794.729203] ---[ end trace e95877675c6ec009 ]---
 [95794.841054] ------------[ cut here ]------------
 [95794.841829] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:5831 btrfs_free_block_groups+0x235/0x36a [btrfs]
 [95794.843425] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
 [95794.850658] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
 [95794.852590] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 [95794.854752] task: ffff880075aa0240 task.stack: ffffc90001734000
 [95794.855812] RIP: 0010:btrfs_free_block_groups+0x235/0x36a [btrfs]
 [95794.856811] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206
 [95794.857805] RAX: 0000000080000000 RBX: ffff88006145c000 RCX: 0000000000000001
 [95794.859014] RDX: 00000001810af668 RSI: 0000000000000002 RDI: 00000000ffffffff
 [95794.860270] RBP: ffffc90001737d98 R08: 0000000000000000 R09: ffffffff817e22b9
 [95794.861525] R10: ffffc90001737c80 R11: 00000000000337fd R12: 0000000000000000
 [95794.862700] R13: ffff88006145c0c0 R14: ffff88021b61a800 R15: ffff88006145c100
 [95794.863810] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [95794.865149] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [95794.866099] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
 [95794.867198] Call Trace:
 [95794.867626]  close_ctree+0x1db/0x2b8 [btrfs]
 [95794.868188]  ? evict_inodes+0x132/0x141
 [95794.869037]  btrfs_put_super+0x15/0x17 [btrfs]
 [95794.870400]  generic_shutdown_super+0x6a/0x10b
 [95794.871262]  kill_anon_super+0x12/0x1c
 [95794.872046]  btrfs_kill_super+0x16/0x21 [btrfs]
 [95794.872746]  deactivate_locked_super+0x30/0x68
 [95794.873687]  deactivate_super+0x36/0x39
 [95794.874639]  cleanup_mnt+0x49/0x67
 [95794.875504]  __cleanup_mnt+0x12/0x14
 [95794.876126]  task_work_run+0x82/0xa6
 [95794.876788]  prepare_exit_to_usermode+0xe1/0x10c
 [95794.877777]  syscall_return_slowpath+0x18c/0x1af
 [95794.878381]  entry_SYSCALL_64_fastpath+0xab/0xad
 [95794.878888] RIP: 0033:0x7fa678cb99a7
 [95794.879307] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [95794.880204] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
 [95794.881640] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
 [95794.882690] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
 [95794.883538] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
 [95794.884562] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
 [95794.885664] Code: 89 ef e8 07 ec 32 e1 e8 9d c0 ea e0 48 8d b3 28 02 00 00 48 83 c9 ff 31 d2 48 89 df e8 29 c5 ff ff 48 83 bb 80 02 00 00 00 74 02 <0f> ff 48 83 bb 88 02 00 00 00 74 02 0f ff 48 83 bb d8 02 00 00
 [95794.887980] ---[ end trace e95877675c6ec00a ]---
 [95794.888739] ------------[ cut here ]------------
 [95794.889405] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:5832 btrfs_free_block_groups+0x241/0x36a [btrfs]
 [95794.891020] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
 [95794.897551] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
 [95794.898509] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 [95794.899685] task: ffff880075aa0240 task.stack: ffffc90001734000
 [95794.900592] RIP: 0010:btrfs_free_block_groups+0x241/0x36a [btrfs]
 [95794.901387] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206
 [95794.902300] RAX: 0000000080000000 RBX: ffff88006145c000 RCX: 0000000000000001
 [95794.903260] RDX: 00000001810af668 RSI: 0000000000000002 RDI: 00000000ffffffff
 [95794.904332] RBP: ffffc90001737d98 R08: 0000000000000000 R09: ffffffff817e22b9
 [95794.905300] R10: ffffc90001737c80 R11: 00000000000337fd R12: 0000000000000000
 [95794.906439] R13: ffff88006145c0c0 R14: ffff88021b61a800 R15: ffff88006145c100
 [95794.907459] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [95794.908625] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [95794.909511] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
 [95794.910630] Call Trace:
 [95794.911153]  close_ctree+0x1db/0x2b8 [btrfs]
 [95794.911837]  ? evict_inodes+0x132/0x141
 [95794.912344]  btrfs_put_super+0x15/0x17 [btrfs]
 [95794.912975]  generic_shutdown_super+0x6a/0x10b
 [95794.913788]  kill_anon_super+0x12/0x1c
 [95794.914424]  btrfs_kill_super+0x16/0x21 [btrfs]
 [95794.915142]  deactivate_locked_super+0x30/0x68
 [95794.915831]  deactivate_super+0x36/0x39
 [95794.916433]  cleanup_mnt+0x49/0x67
 [95794.917045]  __cleanup_mnt+0x12/0x14
 [95794.917665]  task_work_run+0x82/0xa6
 [95794.918309]  prepare_exit_to_usermode+0xe1/0x10c
 [95794.919021]  syscall_return_slowpath+0x18c/0x1af
 [95794.919722]  entry_SYSCALL_64_fastpath+0xab/0xad
 [95794.920426] RIP: 0033:0x7fa678cb99a7
 [95794.921039] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [95794.922303] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
 [95794.923335] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
 [95794.924364] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
 [95794.925435] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
 [95794.926533] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
 [95794.927557] Code: 48 8d b3 28 02 00 00 48 83 c9 ff 31 d2 48 89 df e8 29 c5 ff ff 48 83 bb 80 02 00 00 00 74 02 0f ff 48 83 bb 88 02 00 00 00 74 02 <0f> ff 48 83 bb d8 02 00 00 00 74 02 0f ff 48 83 bb e0 02 00 00
 [95794.930166] ---[ end trace e95877675c6ec00b ]---
 [95794.930961] ------------[ cut here ]------------
 [95794.931727] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:9953 btrfs_free_block_groups+0x2bc/0x36a [btrfs]
 [95794.932729] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
 [95794.938394] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
 [95794.939842] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 [95794.941455] task: ffff880075aa0240 task.stack: ffffc90001734000
 [95794.942336] RIP: 0010:btrfs_free_block_groups+0x2bc/0x36a [btrfs]
 [95794.943268] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206
 [95794.944127] RAX: ffff8802004fd0e8 RBX: ffff88006145c000 RCX: 0000000000000001
 [95794.945211] RDX: 00000001810af668 RSI: 0000000000000002 RDI: 00000000ffffffff
 [95794.946316] RBP: ffffc90001737d98 R08: 0000000000000000 R09: ffffffff817e22b9
 [95794.947271] R10: ffffc90001737c80 R11: 00000000000337fd R12: ffff8802004fd0e8
 [95794.948219] R13: ffff88006145c0c0 R14: ffff88006145e598 R15: ffff88006145c100
 [95794.949193] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [95794.950495] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [95794.951338] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
 [95794.952361] Call Trace:
 [95794.952811]  close_ctree+0x1db/0x2b8 [btrfs]
 [95794.953522]  ? evict_inodes+0x132/0x141
 [95794.954543]  btrfs_put_super+0x15/0x17 [btrfs]
 [95794.955231]  generic_shutdown_super+0x6a/0x10b
 [95794.955916]  kill_anon_super+0x12/0x1c
 [95794.956414]  btrfs_kill_super+0x16/0x21 [btrfs]
 [95794.956953]  deactivate_locked_super+0x30/0x68
 [95794.957635]  deactivate_super+0x36/0x39
 [95794.958256]  cleanup_mnt+0x49/0x67
 [95794.958701]  __cleanup_mnt+0x12/0x14
 [95794.959181]  task_work_run+0x82/0xa6
 [95794.959635]  prepare_exit_to_usermode+0xe1/0x10c
 [95794.960182]  syscall_return_slowpath+0x18c/0x1af
 [95794.960731]  entry_SYSCALL_64_fastpath+0xab/0xad
 [95794.961438] RIP: 0033:0x7fa678cb99a7
 [95794.961990] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [95794.963111] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
 [95794.963975] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
 [95794.964680] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
 [95794.965763] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
 [95794.966868] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
 [95794.967800] Code: 00 00 00 4c 8b a3 98 25 00 00 49 83 bc 24 60 ff ff ff 00 75 16 49 83 bc 24 68 ff ff ff 00 75 0b 49 83 bc 24 70 ff ff ff 00 74 16 <0f> ff 49 8d b4 24 18 ff ff ff 31 c9 31 d2 48 89 df e8 93 7a ff
 [95794.970629] ---[ end trace e95877675c6ec00c ]---
 [95794.971451] BTRFS info (device sdi): space_info 1 has 7680000 free, is not full
 [95794.972351] BTRFS info (device sdi): space_info total=8388608, used=704512, pinned=0, reserved=0, may_use=4096, readonly=0
 [95794.973595] ------------[ cut here ]------------
 [95794.974353] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:9953 btrfs_free_block_groups+0x2bc/0x36a [btrfs]
 [95794.980163] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs]
 [95794.986461] CPU: 0 PID: 31496 Comm: umount Tainted: G        W       4.14.0-rc6-btrfs-next-54+ #1
 [95794.987591] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 [95794.988929] task: ffff880075aa0240 task.stack: ffffc90001734000
 [95794.989922] RIP: 0010:btrfs_free_block_groups+0x2bc/0x36a [btrfs]
 [95794.990715] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206
 [95794.991431] RAX: ffff88020f6e70e8 RBX: ffff88006145c000 RCX: ffffffff8115a906
 [95794.992455] RDX: ffffffff8115a902 RSI: ffff880075aa0b40 RDI: ffff880075aa0b40
 [95794.993535] RBP: ffffc90001737d98 R08: 0000000000000020 R09: fffffffffffffff7
 [95794.994573] R10: 00000000ffffffc4 R11: ffff8800633b1bc0 R12: ffff88020f6e70e8
 [95794.996250] R13: 0000000000000038 R14: ffff88006145e598 R15: 0000000000000000
 [95794.997233] FS:  00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [95794.998592] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [95794.999484] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0
 [95795.000542] Call Trace:
 [95795.001138]  close_ctree+0x1db/0x2b8 [btrfs]
 [95795.001885]  ? evict_inodes+0x132/0x141
 [95795.002407]  btrfs_put_super+0x15/0x17 [btrfs]
 [95795.003093]  generic_shutdown_super+0x6a/0x10b
 [95795.003720]  kill_anon_super+0x12/0x1c
 [95795.004353]  btrfs_kill_super+0x16/0x21 [btrfs]
 [95795.005095]  deactivate_locked_super+0x30/0x68
 [95795.005716]  deactivate_super+0x36/0x39
 [95795.006388]  cleanup_mnt+0x49/0x67
 [95795.006939]  __cleanup_mnt+0x12/0x14
 [95795.007512]  task_work_run+0x82/0xa6
 [95795.008124]  prepare_exit_to_usermode+0xe1/0x10c
 [95795.008994]  syscall_return_slowpath+0x18c/0x1af
 [95795.009831]  entry_SYSCALL_64_fastpath+0xab/0xad
 [95795.010610] RIP: 0033:0x7fa678cb99a7
 [95795.011193] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [95795.012327] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7
 [95795.013432] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90
 [95795.014558] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015
 [95795.015577] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64
 [95795.016569] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160
 [95795.017662] Code: 00 00 00 4c 8b a3 98 25 00 00 49 83 bc 24 60 ff ff ff 00 75 16 49 83 bc 24 68 ff ff ff 00 75 0b 49 83 bc 24 70 ff ff ff 00 74 16 <0f> ff 49 8d b4 24 18 ff ff ff 31 c9 31 d2 48 89 df e8 93 7a ff
 [95795.020538] ---[ end trace e95877675c6ec00d ]---
 [95795.021259] BTRFS info (device sdi): space_info 4 has 1072775168 free, is not full
 [95795.022390] BTRFS info (device sdi): space_info total=1073741824, used=114688, pinned=0, reserved=0, may_use=786432, readonly=65536

Fix this by ensuring the zero range operation does not call
btrfs_truncate_block() if the corresponding extent is an unwritten one
(it's pointless anyway, since reading from an unwritten extent yields
zeroes).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Filipe Manana
9f13ce743b Btrfs: fix missing inode i_size update after zero range operation
For a fallocate's zero range operation that targets a range with an end
that is not aligned to the sector size, we can end up not updating the
inode's i_size. This happens when the last page of the range maps to an
unwritten (prealloc) extent and before that last page we have either a
hole or a written extent. This is because in this scenario we relied
on a call to btrfs_prealloc_file_range() to update the inode's i_size,
however it can only update the i_size to the "down aligned" end of the
range.

Example:

 $ mkfs.btrfs -f /dev/sdc
 $ mount /dev/sdc /mnt
 $ xfs_io -f -c "pwrite -S 0xff 0 428K" /mnt/foobar
 $ xfs_io -c "falloc -k 428K 4K" /mnt/foobar
 $ xfs_io -c "fzero 0 430K" /mnt/foobar
 $ du --bytes /mnt/foobar
 438272	/mnt/foobar

The inode's i_size was left as 428Kb (438272 bytes) when it should have
been updated to 430Kb (440320 bytes).
Fix this by always updating the inode's i_size explicitly after zeroing
the range.

Fixes: ba6d5887946ff86d93dc ("Btrfs: add support for fallocate's zero range operation")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Filipe Manana
94f450712a Btrfs: use cached state when dirtying pages during buffered write
During a buffered IO write, we can have an extent state that we got when
we locked the range (if the range starts at an offset lower than eof), so
always pass it to btrfs_dirty_pages() so that setting the delalloc bit
in the range does not need to do a full search in the inode's io tree,
saving time and reducing the amount of time we hold the io tree's lock.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Filipe Manana
f27451f229 Btrfs: add support for fallocate's zero range operation
This implements support the zero range operation of fallocate. For now
at least it's as simple as possible while reusing most of the existing
fallocate and hole punching infrastructure.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Liu Bo
cc54ff626a Btrfs: do not merge rbios if their fail stripe index are not identical
Since fail stripe index in rbio would be used to decide which
algorithm reconstruction would be run, we cannot merge rbios if
their's fail striped indexes are different, otherwise, one of the two
reconstructions would fail.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Liu Bo
db34be19c4 Btrfs: remove redundant check in rbio_can_merge
Given the above
'
if (last->operation != cur->operation)
	return 0;
',
it's guaranteed that two operations are same.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Anand Jain
05a5c55dfc btrfs: minor style cleanups in btrfs_scan_one_device
Assign ret = -EINVAL where it is actually required.
Remove { } around single line if else code.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Anand Jain
c1f32b7c1f btrfs: simplify mutex unlocking code in btrfs_commit_transaction
No functional change rearrange the mutex_unlock.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ edit subject ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Anand Jain
cadbc0a067 btrfs: rename btrfs_device::scrub_device to scrub_ctx
btrfs_device::scrub_device is not a device which is being scrubbed,
but it holds the scrub context, so rename to reflect the same. No
functional changes here.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Anand Jain
922ea8994a btrfS: collapse btrfs_handle_error() into __btrfs_handle_fs_error()
There is no other consumer for btrfs_handle_error() other than
__btrfs_handle_fs_error(), further this function quite small.
Merge it into its parent.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Anand Jain
61ecda6865 btrfs: remove check for BTRFS_FS_STATE_ERROR which we just set
__btrfs_handle_fs_error() sets BTRFS_FS_STATE_ERROR, and calls
btrfs_handle_error() so no need to check if the BTRFS_FS_STATE_ERROR
is set in btrfs_handle_error(). And there is no other user of
btrfs_handle_error() as well.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Liu Bo
8810f7517a Btrfs: make raid6 rebuild retry more
There is a scenario that can end up with rebuild process failing to
return good content, i.e.
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,

- the 1st retry is to rebuild with all other stripes, it'll eventually
  be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
  that it will do raid6 style rebuild,

however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content, and users will think of this as data loss.
More seriouly, if the loss happens on some important internal btree
roots, it could refuse to mount.

This extends btrfs to do more retries and each retry fails only one
stripe.  Since raid6 can tolerate 2 disk failures, if there is one
more failure besides the failure on which we're recovering, this can
always work.

The worst case is to retry as many times as the number of raid6 disks,
but given the fact that such a scenario is really rare in practice,
it's still acceptable.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Liu Bo
762221f095 Btrfs: fix scrub to repair raid6 corruption
The raid6 corruption is that,
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,

- the 1st retry is to rebuild with all other stripes, it'll eventually
  be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
  that it will do raid6 style rebuild,

however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content.

We've fixed normal reads to rebuild raid6 correctly with more retries
in Patch "Btrfs: make raid6 rebuild retry more"[1], this is to fix
scrub to do the exactly same rebuild process.

[1]: https://patchwork.kernel.org/patch/10091755/

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
Anand Jain
6528b99d3d btrfs: factor btrfs_check_rw_degradable() to check given device
Update btrfs_check_rw_degradable() to check against the given device if
its lost.

We can use this function to know if the volume is going to be in
degraded mode OR failed state, when the given device fails.  Which is
needed when we are handling the device failed state.

A preparatory patch does not affect the flow as such.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ enhance comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:20 +01:00
David Sterba
e43bbe5e16 btrfs: sink unlock_extent parameter gfp_flags
All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the
parameter to the function, though we lose some of the slightly better
semantics of GFP_KERNEL in some places, it's worth cleaning up the
callchains.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
David Sterba
d810a4be1a btrfs: add separate helper for unlock_extent_cached with GFP_ATOMIC
There's only one instance where we pass different gfp mask to
unlock_extent_cached. Add a separate helper for that and then we can
drop the gfp parameter from unlock_extent_cached.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
David Sterba
5bedc48a8f btrfs: drop unused parameters from mount_subvol
Recent patches reworking the mount path left some unused parameters. We
pass a vfsmount to mount_subvol, the flags and data (ie. mount options)
have been already applied and we will not need them.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Misono, Tomohiro
e215772cd2 btrfs: cleanup unnecessary string dup in btrfs_parse_options()
Long ago, commit edf24abe51 ("btrfs: sanity mount option parsing and
early mount code") split the btrfs_parse_options() into two parts
(btrfs_parse_early_options() and btrfs_parse_options()). As a result,
btrfs_parse_optins no longer gets called twice and is the last one to
parse mount option string. Therefore there is no need to dup it.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Liu Bo
203e02d934 Btrfs: remove unused wait in btrfs_stripe_hash
In fact nobody is waiting on @wait's waitqueue, it can be safely
removed.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Nikolay Borisov
36f7894f66 btrfs: Remove redundant pair of bio_get/set in __btrfs_submit_dio_bio
The bio is not referenced after it has been submitted and the endio is
going to consume the sole reference on successful submission. On error,
the callers of __btrfs_submit_dio_bio do invoke bio_put so we don't
leak it either.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Nikolay Borisov
ffc9c8dd7d btrfs: Remove redundant bio_get/bio_set pair from submit_one_bio
The bio is never referenced after it has been submitted so there is no
point in getting an extra reference.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Nikolay Borisov
ea057f6daf btrfs: Remove redundant bio_get/set from submit_dio_repair_bio
The bio that is passsed is the newly created repair bio which already
has a reference count of 1, which is going to be consumed by the
endio routine on successful submission. On error the handler also
calls bio_put.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Nikolay Borisov
32506af595 btrfs: Remove redundant bio_get/set calls in compressed read/write paths
bio_get/set is necessary only if the bio is going to be referenced
following submissions. In the code paths where such calls are made
we don't really need them since the bio is referenced only if
btrfs_map_bio returns an error. And this function can return an error
prior to submission only. So referencing the bio is safe. Furthermore
we do call bio_endio which will consume the last reference. So let's
remove the redundant calls.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Nikolay Borisov
4271ecea64 btrfs: Improve btrfs_search_slot description
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
David Sterba
36243c9199 btrfs: heuristic: call get4bits directly
As it's a single instance and local to the file, we don't need to pass
it as an argument.

Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
David Sterba
7add17befc btrfs: heuristic: open code copy_call callback of radix sort
The callback is trivial and we don't need the abstraction for our
purposes. Let's open code it.

Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
David Sterba
23ae8c63aa btrfs: heuristic: open code get_num callback of radix sort
The callback is trivial and we don't need the abstraction for our
purposes. Let's open code it and also make the array types explicit.

Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Misono, Tomohiro
78f6beacd0 btrfs: remove unused arg from parse_subvol_options()
Remove unused arg 'holder' from parse_subvol_options(), which has been
forgotten to be cleaned in the commit b99beb110e2d ("btrfs: split
parse_early_options() in two").

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Misono, Tomohiro
83085935cc btrfs: remove unused setup_root_args()
Since setup_root_args() is not used anymore, just remove it.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Misono, Tomohiro
d740760656 btrfs: split parse_early_options() in two
Now parse_early_options() is used by both btrfs_mount() and
btrfs_mount_root(). However, the former only needs subvol related part
and the latter needs the others.

Therefore extract the subvol related parts from parse_early_options() and
move it to new parse function (parse_subvol_options()).

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:18 +01:00
Misono, Tomohiro
312c89fbca btrfs: cleanup btrfs_mount() using btrfs_mount_root()
Cleanup btrfs_mount() by using btrfs_mount_root(). This avoids getting
btrfs_mount() called twice in mount path.

Old btrfs_mount() will do:
0. VFS layer calls vfs_kern_mount() with registered file_system_type
   (for btrfs, btrfs_fs_type). btrfs_mount() is called on the way.
1. btrfs_parse_early_options() parses "subvolid=" mount option and set the
   value to subvol_objectid. Otherwise, subvol_objectid has the initial
   value of 0
2. check subvol_objectid is 5 or not. Assume this time id is not 5, then
   btrfs_mount() returns by calling mount_subvol()
3. In mount_subvol(), original mount options are modified to contain
   "subvolid=0" in setup_root_args(). Then, vfs_kern_mount() is called with
   btrfs_fs_type and new options
4. btrfs_mount() is called again
5. btrfs_parse_early_options() parses "subvolid=0" and set 5 (instead of 0)
   to subvol_objectid
6. check subvol_objectid is 5 or not. This time id is 5 and mount_subvol()
   is not called. btrfs_mount() finishes mounting a root
7. (in mount_subvol()) with using a return vale of vfs_kern_mount(), it
   calls mount_subtree()
8. return subvolume's dentry

Reusing the same file_system_type (and btrfs_mount()) for vfs_kern_mount()
is the cause of complication.

Instead, new btrfs_mount() will do:
1. parse subvol id related options for later use in mount_subvol()
2. mount device's root by calling vfs_kern_mount() with
   btrfs_root_fs_type, which is not registered to VFS by
   register_filesystem(). As a result, btrfs_mount_root() is called
3. return by calling mount_subvol()

The code of 2. is moved from the first part of mount_subvol().

The semantics of device holder changes from btrfs_fs_type to
btrfs_root_fs_type and has to be used in all contexts. Otherwise we'd
get wrong results when mount and dev scan would not check the same
thing. (this has been found indendently and the fix is folded into this
patch)

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ fold the btrfs_control_ioctl fixup, extend the comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:18 +01:00
Misono, Tomohiro
72fa39f5c7 btrfs: add btrfs_mount_root() and new file_system_type
Add btrfs_mount_root() and new file_system_type for preparation of cleanup
of btrfs_mount(). Code path is not changed yet.

btrfs_mount_root() is almost the same as current btrfs_mount(), but doesn't
have subvolume related part.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:18 +01:00