linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-28 08:55:31 +07:00

Author	SHA1	Message	Date
Yan, Zheng	8d8f371c83	ceph: cleanup traceless reply handling for rename ceph_fill_trace() already calls ceph_invalidate_dir_request() for traceless reply. No need to duplicate the code in ceph_rename(). Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-01-29 18:36:06 +01:00
Yan, Zheng	87c91a965a	ceph: voluntarily drop Fx cap for readdir request MDS need to rdlock directory inode's filelock when handling readdir request. Voluntarily dropping CEPH_CAP_AUTH_EXCL avoids a cap revoke message. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-01-29 18:36:05 +01:00
Yan, Zheng	be70489eff	ceph: properly drop caps for setattr request For CEPH_SETATTR_ATIME, MDS needs to xlock filelock, Fsxrw caps are not allowed for xlocked filelock. For CEPH_SETATTR_SIZE request that truncates file to smaller size, MDS needs to xlock filelock, Fsxrw caps are not allowed for xlocked filelock. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-01-29 18:36:05 +01:00
Yan, Zheng	d19a0b5401	ceph: voluntarily drop Lx cap for link/rename requests MDS need to xlock inode's linklock when handling link/rename requests. Voluntarily dropping CEPH_CAP_AUTH_EXCL avoids a cap revoke message. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-01-29 18:36:04 +01:00
Yan, Zheng	222b7f90ba	ceph: voluntarily drop Ax cap for requests that create new inode MDS need to rdlock directory inode's authlock when handling these requests. Voluntarily dropping CEPH_CAP_AUTH_EXCL avoids a cap revoke message. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-01-29 18:36:04 +01:00
Bob Peterson	2eb5909dee	GFS2: Don't try to end a non-existent transaction in unlink Before this patch, if function gfs2_unlink failed to get a valid transaction (for example, not enough journal blocks) it would go to label out_end_trans which did gfs2_trans_end. But if the trans_begin failed, there's no transaction to end, and trying to do so results in: kernel BUG at fs/gfs2/trans.c:117! This patch changes the goto so that it does not try to end a non-existent transaction. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2018-01-29 10:00:23 -07:00
Christoph Hellwig	1e369b0e19	xfs: remove experimental tag for reflinks But reject reflink + DAX file systems for now until the code to support reflinks on DAX is actually implemented. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: port to 4.16] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-01-29 07:27:24 -08:00
Darrick J. Wong	6d8a45ce29	xfs: don't screw up direct writes when freesp is fragmented xfs_bmap_btalloc is given a range of file offset blocks that must be allocated to some data/attr/cow fork. If the fork has an extent size hint associated with it, the request will be enlarged on both ends to try to satisfy the alignment hint. If free space is fragmentated, sometimes we can allocate some blocks but not enough to fulfill any of the requested range. Since bmapi_allocate always trims the new extent mapping to match the originally requested range, this results in bmapi_write returning zero and no mapping. The consequences of this vary -- buffered writes will simply re-call bmapi_write until it can satisfy at least one block from the original request. Direct IO overwrites notice nmaps == 0 and return -ENOSPC through the dio mechanism out to userspace with the weird result that writes fail even when we have enough space because the ENOSPC return overrides any partial write status. For direct CoW writes the situation was disastrous because nobody notices us returning an invalid zero-length wrong-offset mapping to iomap and the write goes off into space. Therefore, if free space is so fragmented that we managed to allocate some space but not enough to map into even a single block of the original allocation request range, we should break the alignment hint in order to guarantee at least some forward progress for the direct write. If we return a short allocation to iomap_apply it'll call back about the remaining blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2018-01-29 07:27:24 -08:00
Darrick J. Wong	9f37bd11b4	xfs: check reflink allocation mappings There's a really bad bug in xfs_reflink_allocate_cow -- if bmapi_write can return a zero error code but no mappings. This happens if there's an extent size hint (which causes allocation requests to be rounded to extsz granularity internally), but there wasn't a big enough chunk of free space to start filling at the extsz granularity and fill even one block of the range that we actually requested. In any case, if we got no mappings we can't possibly do anything useful with the contents of imap, so we must bail out with ENOSPC here. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2018-01-29 07:27:24 -08:00
Darrick J. Wong	0c6dda7a1c	iomap: warn on zero-length mappings Don't let the iomap callback get away with feeding us a garbage zero length mapping -- there was a bug in xfs that resulted in those leaking out to hilarious effect. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2018-01-29 07:27:24 -08:00
Darrick J. Wong	4b4c1326fd	xfs: treat CoW fork operations as delalloc for quota accounting Since the CoW fork only exists in memory, it is incorrect to update the on-disk quota block counts when we modify the CoW fork. Unlike the data fork, even real extents in the CoW fork are only delalloc-style reservations (on-disk they're owned by the refcountbt) so they must not be tracked in the on disk quota info. Ensure the i_delayed_blks accounting reflects this too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2018-01-29 07:27:23 -08:00
Darrick J. Wong	01c2e13dca	xfs: only grab shared inode locks for source file during reflink Reflink and dedupe operations remap blocks from a source file into a destination file. The destination file needs exclusive locks on all levels because we're updating its block map, but the source file isn't undergoing any block map changes so we can use a shared lock. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2018-01-29 07:27:23 -08:00
Darrick J. Wong	7c2d238ac6	xfs: allow xfs_lock_two_inodes to take different EXCL/SHARED modes Refactor xfs_lock_two_inodes to take separate locking modes for each inode. Specifically, this enables us to take a SHARED lock on one inode and an EXCL lock on the other. The lock class (MMAPLOCK/ILOCK) must be the same for each inode. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2018-01-29 07:27:23 -08:00
Darrick J. Wong	1364b1d4b5	xfs: reflink should break pnfs leases before sharing blocks Before we share blocks between files, we need to break the pnfs leases on the layout before we start slicing and dicing the block map. The structure of this function sets us up for the lock contention reduction in the next patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2018-01-29 07:27:23 -08:00
Darrick J. Wong	c47b74fb2d	xfs: don't clobber inobt/finobt cursors when xref with rmap Even if we can't use the inobt/finobt cursors to count the number of inode btree blocks, we are never allowed to clobber the cursor of the btree being checked, so don't do this. Found by fuzzing level = ones in xfs/364. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2018-01-29 07:27:23 -08:00
Darrick J. Wong	70c57dcd60	xfs: skip CoW writes past EOF when writeback races with truncate Every so often we blow the ASSERT(type != XFS_IO_COW) in xfs_map_blocks when running fsstress, as we do in generic/269. The cause of this is writeback racing with truncate -- writeback doesn't take the iolock, so truncate can sneak in to decrease i_size and truncate page cache while writeback is gathering buffer heads to schedule writeout. If we hit this race on a block that has a CoW mapping, we'll get a valid imap from the CoW fork but the reduced i_size trims the mapping to zero length (which makes it invalid), so we call xfs_map_blocks to try again. This doesn't do much anyway, since any mapping we get out of that will also be invalid, so we might as well skip the assert and just stop. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2018-01-29 07:27:23 -08:00
Amir Goldstein	acd1d71598	xfs: preserve i_rdev when recycling a reclaimable inode Commit `66f364649d` ("xfs: remove if_rdev") moved storing of rdev value for special inodes to VFS inodes, but forgot to preserve the value of i_rdev when recycling a reclaimable xfs_inode. This was detected by xfstest overlay/017 with inodex=on mount option and xfs base fs. The test does a lookup of overlay chardev and blockdev right after drop caches. Overlayfs inodes hold a reference on underlying xfs inodes when mount option index=on is configured. If drop caches reclaim xfs inodes, before it relclaims overlayfs inodes, that can sometimes leave a reclaimable xfs inode and that test hits that case quite often. When that happens, the xfs inode cache remains broken (zere i_rdev) until the next cycle mount or drop caches. Fixes: `66f364649d` ("xfs: remove if_rdev") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-01-29 07:27:23 -08:00
Darrick J. Wong	751f3767c2	xfs: refactor accounting updates out of xfs_bmap_btalloc Move all the inode and quota accounting updates out of xfs_bmap_btalloc in preparation for fixing some quota accounting problems with copy on write. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-01-29 07:27:23 -08:00
Darrick J. Wong	22431bf3df	xfs: refactor inode verifier corruption error printing Refactor inode verifier error reporting into a non-libxfs function so that we aren't encoding the message format in libxfs. This also changes the kernel dmesg output to resemble buffer verifier errors more closely. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2018-01-29 07:27:22 -08:00
Darrick J. Wong	67a3f6d014	xfs: make tracepoint inode number format consistent Fix all the inode number formats to be consistently (0x%llx) in all trace point definitions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2018-01-29 07:27:22 -08:00
Darrick J. Wong	beaae8cd58	xfs: always zero di_flags2 when we free the inode Always zero the di_flags2 field when we free the inode so that we never end up with an on-disk record for an unallocated inode that also has the reflink iflag set. This is in keeping with the general principle that only files can have the reflink iflag set, even though we'll zero out di_flags2 if we ever reallocate the inode. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2018-01-29 07:27:22 -08:00
Darrick J. Wong	09ac862397	xfs: call xfs_qm_dqattach before performing reflink operations Ensure that we've attached all the necessary dquots before performing reflink operations so that quota accounting is accurate. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2018-01-29 07:27:22 -08:00
Shan Hai	6ca30729c2	xfs: bmap code cleanup Remove the extent size hint and realtime inode relevant code from the xfs_bmapi_reserve_delalloc since it is not called on the inode with extent size hint set or on a realtime inode. Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-01-29 07:27:22 -08:00
Carlos Maiolino	643c8c05e7	Use list_head infra-structure for buffer's log items list Now that buffer's b_fspriv has been split, just replace the current singly linked list of xfs_log_items, by the list_head infrastructure. Also, remove the xfs_log_item argument from xfs_buf_resubmit_failed_buffers(), there is no need for this argument, once the log items can be walked through the list_head in the buffer. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: minor style cleanups] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-01-29 07:27:22 -08:00
Carlos Maiolino	fb1755a645	Split buffer's b_fspriv field By splitting the b_fspriv field into two different fields (b_log_item and b_li_list). It's possible to get rid of an old ABI workaround, by using the new b_log_item field to store xfs_buf_log_item separated from the log items attached to the buffer, which will be linked in the new b_li_list field. This way, there is no more need to reorder the log items list to place the buf_log_item at the beginning of the list, simplifying a bit the logic to handle buffer IO. This also opens the possibility to change buffer's log items list into a proper list_head. b_log_item field is still defined as a void *, because it is still used by the log buffers to store xlog_in_core structures, and there is no need to add an extra field on xfs_buf just for xlog_in_core. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: minor style changes] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-01-29 07:27:22 -08:00
Carlos Maiolino	70a2065533	Get rid of xfs_buf_log_item_t typedef Take advantage of the rework on xfs_buf log items list, to get rid of ths typedef for xfs_buf_log_item. This patch also fix some indentation alignment issues found along the way. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-01-29 07:27:22 -08:00
David S. Miller	3e3ab9ccca	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-29 10:15:51 -05:00
Jeff Layton	3a8c7231d5	btrfs: only dirty the inode in btrfs_update_time if something was changed At this point, we know that "now" and the file times may differ, and we suspect that the i_version has been flagged to be bumped. Attempt to bump the i_version, and only mark the inode dirty if that actually occurred or if one of the times was updated. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>	2018-01-29 06:42:21 -05:00
Jeff Layton	d17260fd5f	xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing If XFS_ILOG_CORE is already set then go ahead and increment it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com>	2018-01-29 06:42:21 -05:00
Jeff Layton	e38cf302b2	fs: only set S_VERSION when updating times if necessary We only really need to update i_version if someone has queried for it since we last incremented it. By doing that, we can avoid having to update the inode if the times haven't changed. If the times have changed, then we go ahead and forcibly increment the counter, under the assumption that we'll be going to the storage anyway, and the increment itself is relatively cheap. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>	2018-01-29 06:42:21 -05:00
Jeff Layton	f0e2828062	xfs: convert to new i_version API Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Chinner <dchinner@redhat.com>	2018-01-29 06:42:21 -05:00
Jeff Layton	bb8c2d66bc	ufs: use new i_version API Signed-off-by: Jeff Layton <jlayton@redhat.com>	2018-01-29 06:42:21 -05:00
Jeff Layton	cc56c33e78	ocfs2: convert to new i_version API Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>	2018-01-29 06:42:21 -05:00
Jeff Layton	1f15a550f5	nfsd: convert to new i_version API Mostly just making sure we use the "get" wrappers so we know when it is being fetched for later use. Signed-off-by: Jeff Layton <jlayton@redhat.com>	2018-01-29 06:42:21 -05:00
Jeff Layton	1eb5d98f16	nfs: convert to new i_version API For NFS, we just use the "raw" API since the i_version is mostly managed by the server. The exception there is when the client holds a write delegation, but we only need to bump it once there anyway to handle CB_GETATTR. Tested-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2018-01-29 06:42:21 -05:00
Jeff Layton	ee73f9a52a	ext4: convert to new i_version API Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Theodore Ts'o <tytso@mit.edu>	2018-01-29 06:42:21 -05:00
Jeff Layton	e1d747d9b6	ext2: convert to new i_version API Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>	2018-01-29 06:42:20 -05:00
Jeff Layton	317bc94780	exofs: switch to new i_version API Signed-off-by: Jeff Layton <jlayton@redhat.com>	2018-01-29 06:42:20 -05:00
Jeff Layton	c7f88c4e78	btrfs: convert to new i_version API Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: David Sterba <dsterba@suse.com>	2018-01-29 06:42:20 -05:00
Jeff Layton	a01179e6eb	afs: convert to new i_version API For AFS, it's generally treated as an opaque value, so we use the *_raw variants of the API here. Note that AFS has quite a different definition for this counter. AFS only increments it on changes to the data to the data in regular files and contents of the directories. Inode metadata changes do not result in a version increment. We'll need to reconcile that somehow if we ever want to present this to userspace via statx. Signed-off-by: Jeff Layton <jlayton@redhat.com>	2018-01-29 06:42:20 -05:00
Jeff Layton	9dffe569d9	affs: convert to new i_version API Signed-off-by: Jeff Layton <jlayton@redhat.com>	2018-01-29 06:42:20 -05:00
Jeff Layton	2489dbabea	fat: convert to new i_version API Signed-off-by: Jeff Layton <jlayton@redhat.com>	2018-01-29 06:42:20 -05:00
Jeff Layton	ae5e165d85	fs: new API for handling inode->i_version Add a documentation blob that explains what the i_version field is, how it is expected to work, and how it is currently implemented by various filesystems. We already have inode_inc_iversion. Add several other functions for manipulating and accessing the i_version counter. For now, the implementation is trivial and basically works the way that all of the open-coded i_version accesses work today. Future patches will convert existing users of i_version to use the new API, and then convert the backend implementation to do things more efficiently. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>	2018-01-29 06:41:30 -05:00
Trond Myklebust	e231c6879c	NFS: Fix a race between mmap() and O_DIRECT When locking the file in order to do O_DIRECT on it, we must unmap any mmapped ranges on the pagecache so that we can flush out the dirty data. Fixes: `a5864c999d` ("NFS: Do not serialise O_DIRECT reads and writes") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v4.8+	2018-01-28 22:00:15 -05:00
Achilles Gaikwad	36c7ce4a17	fs/cifs/cifsacl.c Fixes typo in a comment Signed-off-by: Achilles Gaikwad <achillesgaikwad@gmail.com> Signed-off-by: Steve French <smfrench@gmail.com>	2018-01-28 09:19:45 -06:00
Trond Myklebust	128159f292	NFS: Remove a redundant call to unmap_mapping_range() We don't need to call unmap_mapping_range() prior to calling nfs_sync_mapping(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2018-01-28 09:35:54 -05:00
Steve French	ab2c643309	update internal version number for cifs.ko To version 2.11 Signed-off-by: Steve French <smfrench@gmail.com>	2018-01-26 17:03:01 -06:00
Andrés Souto	cd1aca29fa	cifs: add .splice_write add splice_write support in cifs vfs using iter_file_splice_write Signed-off-by: Andrés Souto <kai670@gmail.com> Signed-off-by: Steve French <smfrench@gmail.com>	2018-01-26 17:03:01 -06:00
Aurelien Aptel	4a1360d01d	CIFS: document tcon/ses/server refcount dance Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-26 17:03:00 -06:00
Steve French	6b314714ff	move a few externs to smbdirect.h to eliminate warning Quiet minor sparse warnings in new SMB3 rdma patch series ("symbol was not declared ...") by moving these externs to smbdirect.h Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-26 17:03:00 -06:00
Aurelien Aptel	97f4b7276b	CIFS: zero sensitive data when freeing also replaces memset()+kfree() by kzfree(). Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Cc: <stable@vger.kernel.org>	2018-01-26 17:03:00 -06:00
Steve French	2026b06e9c	Cleanup some minor endian issues in smb3 rdma Minor cleanup of some sparse warnings (including a few misc endian fixes for the new smb3 rdma code) Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-26 17:03:00 -06:00
Aurelien Aptel	02cf5905e3	CIFS: dump IPC tcon in debug proc file dump it as first share with an "IPC: " prefix. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-26 17:03:00 -06:00
Aurelien Aptel	63a83b861c	CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl Since IPC now has a tcon object, the caller can just pass it. This allows domain-based DFS requests to work with smb2+. Link: https://bugzilla.samba.org/show_bug.cgi?id=12917 Fixes: `9d49640a21` ("CIFS: implement get_dfs_refer for SMB2+") Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-26 17:03:00 -06:00
Aurelien Aptel	b327a717e5	CIFS: make IPC a regular tcon * Remove ses->ipc_tid. * Make IPC$ regular tcon. * Add a direct pointer to it in ses->tcon_ipc. * Distinguish PIPE tcon from IPC tcon by adding a tcon->pipe flag. All IPC tcons are pipes but not all pipes are IPC. * All TreeConnect functions now cannot take a NULL tcon object. The IPC tcon has the same lifetime as the session it belongs to. It is created when the session is created and destroyed when the session is destroyed. Since no mounts directly refer to the IPC tcon, its refcount should always be set to initialisation value (1). Thus we make sure cifs_put_tcon() skips it. If the mount request resulting in a new session being created requires encryption, try to require it too for IPC. * set SERVER_NAME_LENGTH to serverName actual size The maximum length of an ipv6 string representation is defined in INET6_ADDRSTRLEN as 45+1 for null but lets keep what we know works. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-26 17:03:00 -06:00
Martin Brandenburg	6793f1c450	orangefs: fix deadlock; do not write i_size in read_iter After do_readv_writev, the inode cache is invalidated anyway, so i_size will never be read. It will be fetched from the server which will also know about updates from other machines. Fixes deadlock on 32-bit SMP. See https://marc.info/?l=linux-fsdevel&m=151268557427760&w=2 Signed-off-by: Martin Brandenburg <martin@omnibond.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Mike Marshall <hubcap@omnibond.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-01-25 17:26:24 -08:00
Jake Daryll Obina	5bdd0c6f89	jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path If jffs2_iget() fails for a newly-allocated inode, jffs2_do_clear_inode() can get called twice in the error handling path, the first call in jffs2_iget() itself and the second through iget_failed(). This can result to a use-after-free error in the second jffs2_do_clear_inode() call, such as shown by the oops below wherein the second jffs2_do_clear_inode() call was trying to free node fragments that were already freed in the first jffs2_do_clear_inode() call. [ 78.178860] jffs2: error: (1904) jffs2_do_read_inode_internal: CRC failed for read_inode of inode 24 at physical location 0x1fc00c [ 78.178914] Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b6b7b [ 78.185871] pgd = ffffffc03a567000 [ 78.188794] [6b6b6b6b6b6b6b7b] pgd=0000000000000000, pud=0000000000000000 [ 78.194968] Internal error: Oops: 96000004 [#1] PREEMPT SMP ... [ 78.513147] PC is at rb_first_postorder+0xc/0x28 [ 78.516503] LR is at jffs2_kill_fragtree+0x28/0x90 [jffs2] [ 78.520672] pc : [<ffffff8008323d28>] lr : [<ffffff8000eb1cc8>] pstate: 60000105 [ 78.526757] sp : ffffff800cea38f0 [ 78.528753] x29: ffffff800cea38f0 x28: ffffffc01f3f8e80 [ 78.532754] x27: 0000000000000000 x26: ffffff800cea3c70 [ 78.536756] x25: 00000000dc67c8ae x24: ffffffc033d6945d [ 78.540759] x23: ffffffc036811740 x22: ffffff800891a5b8 [ 78.544760] x21: 0000000000000000 x20: 0000000000000000 [ 78.548762] x19: ffffffc037d48910 x18: ffffff800891a588 [ 78.552764] x17: 0000000000000800 x16: 0000000000000c00 [ 78.556766] x15: 0000000000000010 x14: 6f2065646f6e695f [ 78.560767] x13: 6461657220726f66 x12: 2064656c69616620 [ 78.564769] x11: 435243203a6c616e x10: 7265746e695f6564 [ 78.568771] x9 : 6f6e695f64616572 x8 : ffffffc037974038 [ 78.572774] x7 : bbbbbbbbbbbbbbbb x6 : 0000000000000008 [ 78.576775] x5 : 002f91d85bd44a2f x4 : 0000000000000000 [ 78.580777] x3 : 0000000000000000 x2 : 000000403755e000 [ 78.584779] x1 : 6b6b6b6b6b6b6b6b x0 : 6b6b6b6b6b6b6b6b ... [ 79.038551] [<ffffff8008323d28>] rb_first_postorder+0xc/0x28 [ 79.042962] [<ffffff8000eb5578>] jffs2_do_clear_inode+0x88/0x100 [jffs2] [ 79.048395] [<ffffff8000eb9ddc>] jffs2_evict_inode+0x3c/0x48 [jffs2] [ 79.053443] [<ffffff8008201ca8>] evict+0xb0/0x168 [ 79.056835] [<ffffff8008202650>] iput+0x1c0/0x200 [ 79.060228] [<ffffff800820408c>] iget_failed+0x30/0x3c [ 79.064097] [<ffffff8000eba0c0>] jffs2_iget+0x2d8/0x360 [jffs2] [ 79.068740] [<ffffff8000eb0a60>] jffs2_lookup+0xe8/0x130 [jffs2] [ 79.073434] [<ffffff80081f1a28>] lookup_slow+0x118/0x190 [ 79.077435] [<ffffff80081f4708>] walk_component+0xfc/0x28c [ 79.081610] [<ffffff80081f4dd0>] path_lookupat+0x84/0x108 [ 79.085699] [<ffffff80081f5578>] filename_lookup+0x88/0x100 [ 79.089960] [<ffffff80081f572c>] user_path_at_empty+0x58/0x6c [ 79.094396] [<ffffff80081ebe14>] vfs_statx+0xa4/0x114 [ 79.098138] [<ffffff80081ec44c>] SyS_newfstatat+0x58/0x98 [ 79.102227] [<ffffff800808354c>] __sys_trace_return+0x0/0x4 [ 79.106489] Code: d65f03c0 f9400001 b40000e1 aa0103e0 (f9400821) The jffs2_do_clear_inode() call in jffs2_iget() is unnecessary since iget_failed() will eventually call jffs2_do_clear_inode() if needed, so just remove it. Fixes: `5451f79f5f` ("iget: stop JFFS2 from using iget() and read_inode()") Reviewed-by: Richard Weinberger <richard@nod.at> Signed-off-by: Jake Daryll Obina <jake.obina@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-01-25 19:34:30 -05:00
Alexey Dobriyan	b35d786b67	dcache: delete unused d_hash_mask Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-01-25 19:34:30 -05:00
Alexey Dobriyan	854d3e6343	dcache: subtract d_hash_shift from 32 in advance Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-01-25 19:34:29 -05:00
Eric Biggers	01950a349e	fs/buffer.c: fold init_buffer() into init_page_buffers() Since commit `e76004093d` ("fs/buffer.c: remove unnecessary init operation after allocating buffer_head"), there are no callers of init_buffer() outside of init_page_buffers(). So just fold it into init_page_buffers(). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-01-25 19:34:28 -05:00
Eric Biggers	4bfd054ae1	fs: fold __inode_permission() into inode_permission() Since commit `9c630ebefe` ("ovl: simplify permission checking"), overlayfs doesn't call __inode_permission() anymore, which leaves no users other than inode_permission(). So just fold it back into inode_permission(). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-01-25 19:34:28 -05:00
Chao Yu	1c1d35df71	f2fs: support inode creation time This patch adds creation time field in inode layout to support showing kstat.btime in ->statx. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-25 14:10:39 -08:00
Benjamin Coddington	f34462c3c8	pnfs/blocklayout: Ensure disk address in block device map It's possible that the device map is smaller than the offset into the device for the I/O we're adding. Add a check for it and bail out, otherwise we risk botching the bio calculations that follow. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trondmy@gmail.com>	2018-01-25 16:42:35 -05:00
Benjamin Coddington	b39604755c	pnfs/blocklayout: pnfs_block_dev_map uses bytes, not sectors Fixup the field types to match their use. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trondmy@gmail.com>	2018-01-25 16:42:35 -05:00
Yunlei He	068c3cd858	f2fs: rebuild sit page from sit info in mem This patch rebuild sit page from sit info in mem instead of issue a read io. I test this method and the result is as below: Pre: mmc_perf_test-12061 [001] ...1 976.819992: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-12061 [001] ...1 976.856446: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-12061 [003] ...1 998.976946: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-12061 [003] ...1 999.023269: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-12061 [003] ...1 1022.060772: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-12061 [003] ...1 1022.111034: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-12061 [002] ...1 1070.127643: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-12061 [003] ...1 1070.187352: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-12061 [003] ...1 1095.942124: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-12061 [003] ...1 1095.995975: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-12061 [003] ...1 1122.535091: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-12061 [003] ...1 1122.586521: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-12061 [001] ...1 1147.897487: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-12061 [001] ...1 1147.959438: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-12061 [003] ...1 1177.926951: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-12061 [002] ...1 1177.976823: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-12061 [002] ...1 1204.176087: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-12061 [002] ...1 1204.239046: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit Some sit flush consume more than 50ms. Now: mmc_perf_test-2187 [007] ...1 196.840684: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-2187 [007] ...1 196.841258: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-2187 [007] ...1 219.430582: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-2187 [007] ...1 219.431144: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-2187 [002] ...1 243.638678: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-2187 [000] ...1 243.638980: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-2187 [002] ...1 265.392180: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-2187 [002] ...1 265.392245: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-2187 [000] ...1 290.309051: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-2187 [000] ...1 290.309116: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-2187 [003] ...1 317.144209: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-2187 [003] ...1 317.145913: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-2187 [005] ...1 343.224954: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-2187 [005] ...1 343.225574: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-2187 [000] ...1 370.239846: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-2187 [000] ...1 370.241138: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-2187 [001] ...1 397.029043: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-2187 [001] ...1 397.030750: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit mmc_perf_test-2187 [003] ...1 425.386377: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = start flush sit mmc_perf_test-2187 [003] ...1 425.387735: f2fs_write_checkpoint: dev = (259,44), checkpoint for Sync, state = end flush sit Most sit flush consume no more than 1ms. Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-25 10:44:34 -08:00
Chao Yu	3b60d802d9	f2fs: stop issuing discard if fs is readonly If filesystem is readonly, stop to issue discard in daemon. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-25 10:40:10 -08:00
Chao Yu	6819b884e0	f2fs: clean up duplicated assignment in init_discard_policy Remove duplicated codes of assignment for .max_requests and .io_aware_gran. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-25 10:40:01 -08:00
Chao Yu	2882d34310	f2fs: use GFP_F2FS_ZERO for cleanup Clean up codes with GFP_F2FS_ZERO, no logic changes. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-25 10:39:49 -08:00
Bob Peterson	4519eaad72	GFS2: Fix minor comment typo Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2018-01-25 10:18:06 -07:00
Linus Torvalds	525273fb2e	for-4.15 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlppIo8ACgkQxWXV+ddt WDuu9A//e29aPiicPrtIovNfVCzev774UddFBhIiEqyfbO0QvHk7Ld0d9+htR38W eAwxJG3gH4VS9yOFAyk/LrsjpQZYzP32wjGoIXQlz8J1SzxALlILLmjpYkZWHwZX UqAomYDnRjFu4jsHZ/AAyUNngYER8aeH+aO1PudppBiSCIVpk1TAVRnkt04gCJ7e CigFgMdRc8pT5P0T6Rsv2+W/yJC7sU0BgDIBjmUcnduqEAxRYl0zsJzpP0IYPPRa cAnpyu/ApK/m9mlLWi0SfUyNvePFWNxfA2QDBO5G6FwM6C2f5x/zBGfu29wAupVJ czdOSR5uqCd6WGZHfvaQ4cgQ69AE4lk68zijHRNESPa7tVZ9mQt713h2BqAX6Fus z+y45ti9CrFoOhaoXCSjUbpJZQ7YKWrat44Qi8pJaKGxqiqXpT8oVcBfjofu+Drn vnDrU+7WvST5TxcWw/JBWaFfft1dkbdKX+EYZW8sVJAmA/e0+aIfElsTuvdUQxuA aLnsyvsYQHVcA/mArD12sScQ13DDejR4ija8bP5tpmodxquUouwEAtKPoFnXVT6k 8hbb6qPLsiBerRMSSuaNaFF/ESi36bsizH3O99bYyyxlR8bXi7irHZY9VPWc1OK3 4twoDO0RhCxDHPY5mrIcKdlp9HiiUBPGos3BCaOAg8rKHd/QP7Q= =EOAu -----END PGP SIGNATURE----- Merge tag 'for-4.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "It's been reported recently that readdir can list stale entries under some conditions. Fix it." * tag 'for-4.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix stale entries in readdir	2018-01-25 09:03:10 -08:00
Colin Ian King	37e12f5551	cifs: remove redundant duplicated assignment of pointer 'node' Node is assigned twice to rb_first(root), first during declaration time and second after a taking a spin lock, so we have a duplicated assignment. Remove the first assignment because it is redundant and also not protected by the spin lock. Cleans up clang warning: fs/cifs/connect.c:4435:18: warning: Value stored to 'node' during its initialization is never read Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-24 19:49:07 -06:00
Arnd Bergmann	e36c048a9b	CIFS: SMBD: work around gcc -Wmaybe-uninitialized warning GCC versions from 4.9 to 6.3 produce a false-positive warning when dealing with a conditional spin_lock_irqsave(): fs/cifs/smbdirect.c: In function 'smbd_recv_buf': include/linux/spinlock.h:260:3: warning: 'flags' may be used uninitialized in this function [-Wmaybe-uninitialized] This function calls some sleeping interfaces, so it is clear that it does not get called with interrupts disabled and there is no need to save the irq state before taking the spinlock. This lets us remove the variable, which makes the function slightly more efficient and avoids the warning. A further cleanup could do the same change for other functions in this file, but I did not want to take this too far for now. Fixes: ac69f66e54ca ("CIFS: SMBD: Implement function to receive data via RDMA receive") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steve French <smfrench@gmail.com>	2018-01-24 19:49:07 -06:00
Daniel N Pettersson	9aca7e4544	cifs: Fix autonegotiate security settings mismatch Autonegotiation gives a security settings mismatch error if the SMB server selects an SMBv3 dialect that isn't SMB3.02. The exact error is "protocol revalidation - security settings mismatch". This can be tested using Samba v4.2 or by setting the global Samba setting max protocol = SMB3_00. The check that fails in smb3_validate_negotiate is the dialect verification of the negotiate info response. This is because it tries to verify against the protocol_id in the global smbdefault_values. The protocol_id in smbdefault_values is SMB3.02. In SMB2_negotiate the protocol_id in smbdefault_values isn't updated, it is global so it probably shouldn't be, but server->dialect is. This patch changes the check in smb3_validate_negotiate to use server->dialect instead of server->vals->protocol_id. The patch works with autonegotiate and when using a specific version in the vers mount option. Signed-off-by: Daniel N Pettersson <danielnp@axis.com> Signed-off-by: Steve French <smfrench@gmail.com> CC: Stable <stable@vger.kernel.org>	2018-01-24 19:49:07 -06:00
kbuild test robot	9084432c31	CIFS: SMBD: _smbd_get_connection() can be static Fixes: 07495ff5d9bc ("CIFS: SMBD: Establish SMB Direct connection") Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Long Li <longli@microsoft.com>	2018-01-24 19:49:07 -06:00
Long Li	8801e90233	CIFS: SMBD: Disable signing on SMB direct transport Currently the CIFS SMB Direct implementation (experimental) doesn't properly support signing. Disable it when SMB Direct is in use for transport. Signing will be enabled in future after it is implemented. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-24 19:49:07 -06:00
Long Li	08a3b9690f	CIFS: SMBD: Add SMB Direct debug counters For debugging and troubleshooting, export SMBDirect debug counters to /proc/fs/cifs/DebugData. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-24 19:49:07 -06:00
Long Li	bd3dcc6a22	CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration If I/O size is larger than rdma_readwrite_threshold, use RDMA write for SMB read by specifying channel SMB2_CHANNEL_RDMA_V1 or SMB2_CHANNEL_RDMA_V1_INVALIDATE in the SMB packet, depending on SMB dialect used. Append a smbd_buffer_descriptor_v1 to the end of the SMB packet and fill in other values to indicate this SMB read uses RDMA write. There is no need to read from the transport for incoming payload. At the time SMB read response comes back, the data is already transferred and placed in the pages by RDMA hardware. When SMB read is finished, deregister the memory regions if RDMA write is used for this SMB read. smbd_deregister_mr may need to do local invalidation and sleep, if server remote invalidation is not used. There are situations where the MID may not be created on I/O failure, under which memory region is deregistered when read data context is released. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-24 19:49:07 -06:00
Long Li	74dcf418fe	CIFS: SMBD: Read correct returned data length for RDMA write (SMB read) I/O This patch is for preparing upper layer doing SMB read via RDMA write. When RDMA write is used for SMB read, the returned data length is in DataRemaining in the response packet. Reading it properly by adding a parameter to specifiy where the returned data length is. Add the defition for memory registration to wdata and return the correct length based on if RDMA write is used. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-24 19:49:07 -06:00
Long Li	db223a590d	CIFS: SMBD: Upper layer performs SMB write via RDMA read through memory registration When sending I/O, if size is larger than rdma_readwrite_threshold we prepare to send SMB write packet for a RDMA read via memory registration. The actual I/O is done by remote peer through local RDMA hardware. Modify the relevant fields in the packet accordingly, and append a smbd_buffer_descriptor_v1 to the end of the SMB write packet. On write I/O finish, deregister the memory region if this was for a RDMA read. If remote invalidation is not used, the call to smbd_deregister_mr will do local invalidation and possibly wait. Memory region is normally deregistered in MID callback as soon as it's used. There are situations where the MID may not be created on I/O failure, under which memory region is deregistered when write data context is released. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-24 19:49:07 -06:00
Long Li	c739858334	CIFS: SMBD: Implement RDMA memory registration Memory registration is used for transferring payload via RDMA read or write. After I/O is done, memory registrations are recovered and reused. This process can be time consuming and is done in a work queue. Signed-off-by: Long Li <longli@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2018-01-24 19:49:06 -06:00
Long Li	9762c2d080	CIFS: SMBD: Upper layer sends data via RDMA send With SMB Direct connected, use it for sending data via RDMA send. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-24 19:49:06 -06:00
Long Li	d649e1bba3	CIFS: SMBD: Implement function to send data via RDMA send The transport doesn't maintain send buffers or send queue for transferring payload via RDMA send. There is no data copy in the transport on send. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-24 19:49:06 -06:00
Long Li	2fef137a2e	CIFS: SMBD: Upper layer receives data via RDMA receive With SMB Direct connected, use it for receiving data via RDMA receive. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-24 19:49:06 -06:00
Long Li	f64b78fd18	CIFS: SMBD: Implement function to receive data via RDMA receive On the receive path, the transport maintains receive buffers and a reassembly queue for transferring payload via RDMA recv. There is data copy in the transport on recv when it copies the payload to upper layer. The transport recognizes the RFC1002 header length use in the SMB upper layer payloads in CIFS. Because this length is mainly used for TCP and not applicable to RDMA, it is handled as a out-of-band information and is never sent over the wire, and the trasnport behaves like TCP to upper layer by processing and exposing the length correctly on data payloads. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-24 19:49:06 -06:00
Long Li	09902f8dc8	CIFS: SMBD: Set SMB Direct maximum read or write size for I/O When connecting over SMB Direct, the transport negotiates its maximum I/O sizes with the server and determines how to choose to do RDMA send/recv vs read/write. Expose these maximum I/O sizes to upper layer so we will get the correct sized payloads. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-24 19:49:06 -06:00
Long Li	bce9ce7cc0	CIFS: SMBD: Upper layer destroys SMB Direct session on shutdown or umount When upper layer wants to umount, make it call shutdown on transport when SMB Direct is used. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-24 19:49:06 -06:00
Long Li	8ef130f9ec	CIFS: SMBD: Implement function to destroy a SMB Direct connection Add function to tear down a SMB Direct connection. This is used by upper layer to free all SMB Direct connection and transport resources. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-24 19:49:06 -06:00
Long Li	781a8050f2	CIFS: SMBD: Upper layer reconnects to SMB Direct session Do a reconnect on SMB Direct when it is used as the connection. Reconnect can happen for many reasons and it's mostly the decision of SMB2 upper layer. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-24 19:49:06 -06:00
Long Li	ad57b8e172	CIFS: SMBD: Implement function to reconnect to a SMB Direct transport Add function to implement a reconnect to SMB Direct. This involves tearing down the current connection and establishing/negotiating a new connection. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-24 19:49:06 -06:00
Long Li	2f8946464b	CIFS: SMBD: Upper layer connects to SMBDirect session When "rdma" is specified in the mount option, make CIFS connect to SMB Direct. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-01-24 19:49:06 -06:00
Randy Dunlap	0933d6fa74	cifs: fix build errors for SMB_DIRECT Prevent build errors when CIFS=y and INFINIBAND=m. fs/cifs/smbdirect.o: In function `smbd_qp_async_error_upcall': smbdirect.c:(.text+0x28c): undefined reference to `ib_event_msg' fs/cifs/smbdirect.o: In function `smbd_destroy_rdma_work': smbdirect.c:(.text+0xfde): undefined reference to `ib_drain_qp' smbdirect.c:(.text+0xfea): undefined reference to `rdma_destroy_qp' smbdirect.c:(.text+0x12a0): undefined reference to `ib_free_cq' smbdirect.c:(.text+0x12ac): undefined reference to `ib_free_cq' smbdirect.c:(.text+0x12b8): undefined reference to `ib_dealloc_pd' smbdirect.c:(.text+0x12c4): undefined reference to `rdma_destroy_id' fs/cifs/smbdirect.o: In function `_smbd_get_connection': smbdirect.c:(.text+0x168c): undefined reference to `rdma_create_id' smbdirect.c:(.text+0x1713): undefined reference to `rdma_resolve_addr' smbdirect.c:(.text+0x1780): undefined reference to `rdma_resolve_route' smbdirect.c:(.text+0x17e3): undefined reference to `rdma_destroy_id' smbdirect.c:(.text+0x183d): undefined reference to `rdma_destroy_id' smbdirect.c:(.text+0x199d): undefined reference to `ib_alloc_cq' smbdirect.c:(.text+0x19d9): undefined reference to `ib_alloc_cq' smbdirect.c:(.text+0x1a89): undefined reference to `rdma_create_qp' smbdirect.c:(.text+0x1b3c): undefined reference to `rdma_connect' smbdirect.c:(.text+0x2538): undefined reference to `rdma_destroy_qp' smbdirect.c:(.text+0x2549): undefined reference to `ib_free_cq' smbdirect.c:(.text+0x255a): undefined reference to `ib_free_cq' smbdirect.c:(.text+0x2563): undefined reference to `ib_dealloc_pd' smbdirect.c:(.text+0x256c): undefined reference to `rdma_destroy_id' smbdirect.c:(.text+0x25f0): undefined reference to `__ib_alloc_pd' smbdirect.c:(.text+0x26bb): undefined reference to `rdma_disconnect' fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work': smbdirect.c:(.text+0x62): undefined reference to `rdma_disconnect' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Steve French <sfrench@samba.org> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org (moderated for non-subscribers) Signed-off-by: Steve French <smfrench@gmail.com>	2018-01-24 19:49:06 -06:00
Matthew Wilcox	f04a703c3d	cifs: Fix missing put_xid in cifs_file_strict_mmap If cifs_zap_mapping() returned an error, we would return without putting the xid that we got earlier. Restructure cifs_file_strict_mmap() and cifs_file_mmap() to be more similar to each other and have a single point of return that always puts the xid. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> CC: Stable <stable@vger.kernel.org>	2018-01-24 19:49:06 -06:00
Long Li	d8ec913b17	CIFS: SMBD: export protocol initial values For use-configurable SMB Direct protocol values, export them to /proc/fs/cifs. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-24 19:49:06 -06:00
Long Li	399f9539d9	CIFS: SMBD: Implement function to create a SMB Direct connection The upper layer calls this function to connect to peer through SMB Direct. Each SMB Direct connection is based on a RDMA RC Queue Pair. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-24 19:49:05 -06:00
Long Li	f198186aa9	CIFS: SMBD: Establish SMB Direct connection Add code to implement the core functions to establish a SMB Direct connection. 1. Establish an RDMA connection to SMB server. 2. Negotiate and setup SMB Direct protocol. 3. Implement idle connection timer and credit management. SMB Direct is enabled by setting CONFIG_CIFS_SMB_DIRECT. Add to Makefile to enable building SMB Direct. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-24 19:49:05 -06:00
Long Li	03bee01d62	CIFS: SMBD: Add SMB Direct protocol initial values and constants To prepare for protocol implementation, add constants and user-configurable values for the SMB Direct protocol. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Ronnie Sahlberg <lsahlber.redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-24 19:49:05 -06:00
Long Li	8339dd32fb	CIFS: SMBD: Add rdma mount option Add "rdma" to CIFS mount options to connect to SMB Direct. Add checks to validate this is used on SMB 3.X dialects. To connect to SMBDirect, use "mount.cifs -o rdma,vers=3.x". At the time of this patch, 3.x can be 3.0, 3.02 or 3.1.1. Signed-off-by: Long Li <longli@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Ronnie Sahlberg <lsahlber.redhat.com>	2018-01-24 19:49:05 -06:00
Long Li	2b6ed88037	CIFS: SMBD: Introduce kernel config option CONFIG_CIFS_SMB_DIRECT Build SMB Direct code when this option is set. Signed-off-by: Long Li <longli@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Ronnie Sahlberg <lsahlber.redhat.com>	2018-01-24 19:49:05 -06:00
Long Li	2dabfd5bab	CIFS: SMBD: Add parameter rdata to smb2_new_read_req This patch is for preparing upper layer for doing SMB read via RDMA write. When we assemble the SMB read packet header, we need to know the I/O layout if this request is to use a RDMA write. rdata has all the information we need for memory registration. Add rdata to smb2_new_read_req. Signed-off-by: Long Li <longli@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Ronnie Sahlberg <lsahlber.redhat.com>	2018-01-24 19:49:05 -06:00
Ronnie Sahlberg	3cecf4865c	cifs: avoid a kmalloc in smb2_send_recv/SendReceive2 for the common case In both functions, use an array of 8 (arbitrary but should be big enough for all current uses) iov and avoid having to kmalloc the array for the common case. If 8 is too small, then fall back to the original behaviour and use kmalloc/kfree. This should not change any behaviour but should save us a tiny amount of cpu cycles. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-24 19:49:05 -06:00
Ronnie Sahlberg	305428acf0	cifs: remove small_smb2_init Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2018-01-24 19:49:05 -06:00
Ronnie Sahlberg	8eb7998e79	cifs: remove rfc1002 header from smb2_lease_ack Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-24 19:49:05 -06:00
Ronnie Sahlberg	5dfe69a407	cifs: remove unused variable from SMB2_read Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2018-01-24 19:49:05 -06:00
Ronnie Sahlberg	21ad9487ca	cifs: remove rfc1002 header from smb2_oplock_break we get from server Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2018-01-24 19:49:05 -06:00
Ronnie Sahlberg	b2fb7fecc9	cifs: remove rfc1002 header from smb2_query_info_req Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2018-01-24 19:49:05 -06:00
Ronnie Sahlberg	7c00c3a625	cifs: remove rfc1002 header from smb2_query_directory_req Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2018-01-24 19:49:05 -06:00
Ronnie Sahlberg	2fc803efe6	cifs: remove rfc1002 header from smb2_set_info_req Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2018-01-24 19:49:05 -06:00
Ronnie Sahlberg	f5688a6d7c	cifs: remove rfc1002 header from smb2 read/write requests Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2018-01-24 19:49:04 -06:00
Ronnie Sahlberg	ced93679cb	cifs: remove rfc1002 header from smb2_lock_req Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2018-01-24 19:49:04 -06:00
Ronnie Sahlberg	1f444e4c06	cifs: remove rfc1002 header from smb2_flush_req Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2018-01-24 19:49:04 -06:00
Ronnie Sahlberg	4f33bc3587	cifs: remove rfc1002 header from smb2_create_req Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2018-01-24 19:49:04 -06:00
Ronnie Sahlberg	88ea5cb7d4	cifs: remove rfc1002 header from smb2_sess_setup_req Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-24 19:49:04 -06:00
Ronnie Sahlberg	661bb943a9	cifs: remove rfc1002 header from smb2_tree_connect_req Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-24 19:49:04 -06:00
Ronnie Sahlberg	7f7ae759fb	cifs: remove rfc1002 header from smb2_echo_req Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-24 19:49:04 -06:00
Ronnie Sahlberg	9775468020	cifs: remove rfc1002 header from smb2_ioctl_req Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-24 19:49:04 -06:00
Ronnie Sahlberg	afcccefdc3	cifs: remove rfc1002 header from smb2_close_req Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-24 19:49:04 -06:00
Ronnie Sahlberg	4eecf4cfe1	cifs: remove rfc1002 header from smb2_tree_disconnect_req Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-24 19:49:04 -06:00
Ronnie Sahlberg	45305eda6b	cifs: remove rfc1002 header from smb2_logoff_req Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-01-24 19:49:04 -06:00
Ronnie Sahlberg	13cacea7bb	cifs: remove rfc1002 header from smb2_negotiate_req Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2018-01-24 19:49:04 -06:00
Ronnie Sahlberg	83b7739180	cifs: Add smb2_send_recv This function is similar to SendReceive2 except it does not expect a 4 byte rfc1002 length header in the first io vector. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com>	2018-01-24 19:49:04 -06:00
Trond Myklebust	535cb8f319	lockd: Fix server refcounting The server shouldn't actually delete the struct nlm_host until it hits the garbage collector. In order to make that work correctly with the refcount API, we can bump the refcount by one, and then use refcount_dec_if_one() in the garbage collector. Signed-off-by: Trond Myklebust <trondmy@gmail.com> Acked-by: J. Bruce Fields <bfields@fieldses.org>	2018-01-24 17:33:57 -05:00
Josef Bacik	e4fd493c05	Btrfs: fix stale entries in readdir In fixing the readdir+pagefault deadlock I accidentally introduced a stale entry regression in readdir. If we get close to full for the temporary buffer, and then skip a few delayed deletions, and then try to add another entry that won't fit, we will emit the entries we found and retry. Unfortunately we delete entries from our del_list as we find them, assuming we won't need them. However our pos will be with whatever our last entry was, which could be before the delayed deletions we skipped, so the next search will add the deleted entries back into our readdir buffer. So instead don't delete entries we find in our del_list so we can make sure we always find our delayed deletions. This is a slight perf hit for readdir with lots of pending deletions, but hopefully this isn't a common occurrence. If it is we can revist this and optimize it. cc: stable@vger.kernel.org Fixes: `23b5ec7494` ("btrfs: fix readdir deadlock with pagefault") Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-24 20:27:48 +01:00
Amir Goldstein	8383f17488	ovl: wire up NFS export operations Now that NFS export operations are implemented, enable overlayfs NFS export support if the "nfs_export" feature is enabled. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:26:06 +01:00
Amir Goldstein	0617015403	ovl: lookup indexed ancestor of lower dir ovl_lookup_real() in lower layer walks back lower parents to find the topmost indexed parent. If an indexed ancestor is found before reaching lower layer root, ovl_lookup_real() is called recursively with upper layer to walk back from indexed upper to the topmost connected/hashed upper parent (or up to root). ovl_lookup_real() in upper layer then walks forward to connect the topmost upper overlay dir dentry and ovl_lookup_real() in lower layer continues to walk forward to connect the decoded lower overlay dir dentry. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:26:05 +01:00
Amir Goldstein	4b91c30a5a	ovl: lookup connected ancestor of dir in inode cache Decoding a dir file handle requires walking backward up to layer root and for lower dir also checking the index to see if any of the parents have been copied up. Lookup overlay ancestor dentry in inode/dentry cache by decoded real parents to shortcut looking up all the way back to layer root. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:26:05 +01:00
Amir Goldstein	7a9dadef96	ovl: hash non-indexed dir by upper inode for NFS export Non-indexed upper dirs are encoded as upper file handles. When NFS export is enabled, hash non-indexed directory inodes by upper inode, so we can find them in inode cache using the decoded upper inode. When NFS export is disabled, directories are not indexed on copy up, so hash non-indexed directory inodes by origin inode, the same hash key that is used before copy up. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:26:04 +01:00
Amir Goldstein	988925164f	ovl: decode pure lower dir file handles Similar to decoding a pure upper dir file handle, decoding a pure lower dir file handle is implemented by looking an overlay dentry of the same path as the pure lower path and verifying that the overlay dentry's real lower matches the decoded real lower file handle. Unlike the case of upper dir file handle, the lookup of overlay path by lower real path can fail or find a mismatched overlay dentry if any of the lower parents have been copied up and renamed. To address this case we will need to check if any of the lower parents are indexed. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:26:04 +01:00
Amir Goldstein	3b0bfc6ed3	ovl: decode indexed dir file handles Decoding an indexed dir file handle is done by looking up the file handle in index dir by name and then decoding the upper dir from the index origin file handle. The decoded upper path is used to lookup an overlay dentry of the same path. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:26:03 +01:00
Amir Goldstein	9436a1a339	ovl: decode lower file handles of unlinked but open files Lookup overlay inode in cache by origin inode, so we can decode a file handle of an open file even if the index has a whiteout index entry to mark this overlay inode was unlinked. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:26:03 +01:00
Amir Goldstein	f71bd9cfb6	ovl: decode indexed non-dir file handles Decoding an indexed non-dir file handle is similar to decoding a lower non-dir file handle, but additionally, we lookup the file handle in index dir by name to find the real upper inode. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:26:03 +01:00
Amir Goldstein	f941866fc4	ovl: decode lower non-dir file handles Decoding a lower non-dir file handle is done by decoding the lower dentry from underlying lower fs, finding or allocating an overlay inode that is hashed by the real lower inode and instantiating an overlay dentry with that inode. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:26:02 +01:00
Amir Goldstein	03e1c584ff	ovl: encode lower file handles For indexed or lower non-dir, encode a non-connectable lower file handle from origin inode. For indexed or lower dir, when ofs->numlower == 1, encode a lower file handle from lower dir. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:26:02 +01:00
Amir Goldstein	05e1f11816	ovl: copy up before encoding non-connectable dir file handle Decoding a merge dir, whose origin's parent is under a redirected lower dir is not always possible. As a simple aproximation, we do not encode lower dir file handles when overlay has multiple lower layers and origin is below the topmost lower layer. We should later relax this condition and copy up only the parent that is under a redirected lower. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:26:01 +01:00
Amir Goldstein	b305e8443f	ovl: encode non-indexed upper file handles We only need to encode origin if there is a chance that the same object was encoded pre copy up and then we need to stay consistent with the same encoding also after copy up. In case a non-pure upper is not indexed, then it was copied up before NFS export support was enabled. In that case, we don't need to worry about staying consistent with pre copy up encoding and we encode an upper file handle. This mitigates the problem that with no index, we cannot find an upper inode from origin inode, so we cannot decode a non-indexed upper from origin file handle. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:26:01 +01:00
Amir Goldstein	3985b70a3e	ovl: decode connected upper dir file handles Until this change, we decoded upper file handles by instantiating an overlay dentry from the real upper dentry. This is sufficient to handle pure upper files, but insufficient to handle merge/impure dirs. To that end, if decoded real upper dir is connected and hashed, we lookup an overlay dentry with the same path as the real upper dir. If decoded real upper is non-dir, we instantiate a disconnected overlay dentry as before this change. Because ovl_fh_to_dentry() returns a connected overlay dir dentry, exportfs never needs to call get_parent() and get_name() to reconnect an upper overlay dir. Because connectable non-dir file handles are not supported, exportfs will not be able to use fh_to_parent() and get_name() methods to reconnect a disconnected non-dir to its parent. Therefore, the methods get_parent() and get_name() are implemented just to print out a sanity warning and the method fh_to_parent() is implemented to warn the user that using the 'subtree_check' exportfs option is not supported. An alternative approach could have been to implement instantiating of an overlay directory inode from origin/index and implement get_parent() and get_name() by calling into underlying fs operations and them instantiating the overlay parent dir. The reasons for not choosing the get_parent() approach were: - Obtaining a disconnected overlay dir dentry would requires a delicate re-factoring of ovl_lookup() to get a dentry with overlay parent info. It was preferred to avoid doing that re-factoring unless it was proven worthy. - Going down the path of disconnected dir would mean that the (non trivial) code path of d_splice_alias() could be traveled and that meant writing more tests and introduces race cases that are very hard to hit on purpose. Taking the path of connecting overlay dentry by forward lookup is therefore the safe and boring way to avoid surprises. The culprits of the chosen "connected overlay dentry" approach: - We need to take special care to rename of ancestors while connecting the overlay dentry by real dentry path. These subtleties are usually handled by generic exportfs and VFS code. - In a hypothetical workload, we could end up in a loop trying to connect, interrupted by rename and restarting connect forever. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:26:00 +01:00
Amir Goldstein	8556a4205b	ovl: decode pure upper file handles Decoding an upper file handle is done by decoding the upper dentry from underlying upper fs, finding or allocating an overlay inode that is hashed by the real upper inode and instantiating an overlay dentry with that inode. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:26:00 +01:00
Amir Goldstein	8ed5eec9d6	ovl: encode pure upper file handles Encode overlay file handles as struct ovl_fh containing the file handle encoding of the real upper inode. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:59 +01:00
Miklos Szeredi	f9c34674bc	vfs: factor out helpers d_instantiate_anon() and d_alloc_anon() Those helpers are going to be used by overlayfs to implement NFS export decode. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:59 +01:00
Amir Goldstein	c62520a83b	ovl: store 'has_upper' and 'opaque' as bit flags We need to make some room in struct ovl_entry to store information about redirected ancestors for NFS export, so cram two booleans as bit flags. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:58 +01:00
Amir Goldstein	aa3ff3c152	ovl: copy up of disconnected dentries With NFS export, some operations on decoded file handles (e.g. open, link, setattr, xattr_set) may call copy up with a disconnected non-dir. In this case, we will copy up lower inode to index dir without linking it to upper dir. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:58 +01:00
Amir Goldstein	829c28be9b	ovl: use d_splice_alias() in place of d_add() in lookup This is required for NFS export. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:57 +01:00
Amir Goldstein	0aceb53e73	ovl: do not pass overlay dentry to ovl_get_inode() This is needed for using ovl_get_inode() for decoding file handles for NFS export. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:57 +01:00
Amir Goldstein	91ffe7beb3	ovl: factor out ovl_get_index_fh() helper The helper is needed to lookup an index by file handle for NFS export. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:56 +01:00
Amir Goldstein	24f0b17203	ovl: whiteout orphan index entries on mount Orphan index entries are non-dir index entries whose union nlink count dropped to zero. With index=on, orphan index entries are removed on mount. With NFS export feature enabled, orphan index entries are replaced with white out index entries to block future open by handle from opening the lower file. When dir index has a stale 'upper' xattr, we assume that the upper dir was removed and we treat the dir index as orphan entry that needs to be whited out or removed. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:56 +01:00
Amir Goldstein	e7dd0e7134	ovl: whiteout index when union nlink drops to zero With NFS export feature enabled, when overlay inode nlink drops to zero, instead of removing the index entry, replace it with a whiteout index entry. This is needed for NFS export in order to prevent future open by handle from opening the lower file directly. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:56 +01:00
Amir Goldstein	89a17556ce	ovl: cleanup dir index when dir nlink drops to zero When non-dir index union nlink drops to zero the non-dir index is cleaned. Do the same for directory type index entries when union directory is removed. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:55 +01:00
Amir Goldstein	016b720f55	ovl: index directories on copy up for NFS export With the NFS export feature enabled, all dirs are indexed on copy up. Non-dir files are copied up directly to indexdir and then hardlinked to upper dir. Directories are copied up to indexdir, then an index entry is created in indexdir with 'upper' xattr pointing to the copied up dir and then the copied up dir is moved to upper dir. Directory index is also used for consistency verification, like detecting multiple redirected dirs to the same lower dir on lookup. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:55 +01:00
Amir Goldstein	fbd2d2074b	ovl: index all non-dir on copy up for NFS export With the NFS export feature enabled, all non-dir are indexed on copy up. The copy up origin inode of an indexed non-dir can be used as a unique identifier of the overlay object. The full index is also used for consistency verfication, like detecting multiple non-hardlink uppers with the same 'origin' on lookup. Directory index on copy up will be implemented by following patch. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:54 +01:00
Amir Goldstein	24b33ee104	ovl: create ovl_need_index() helper The helper determines which lower file needs to be indexed on copy up and before nlink changes. For index=on, the helper evaluates to true for lower hardlinks. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:54 +01:00
Amir Goldstein	9ee60ce249	ovl: cleanup temp index entries A previous failed attempt to create or whiteout a directory index may leave index entries named '#%x' in the index dir. Cleanup those temp entries on mount instead of failing the mount. In the future, we may drop 'work' dir and use 'index' dir instead. This change is enough for cleaning up copy up leftovers 'from the future', but it is not enough for cleaning up rmdir leftovers 'from the future' (i.e. temp dir containing whiteouts). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:53 +01:00
Amir Goldstein	e8f9e5b780	ovl: verify directory index entries on mount Directory index entries should have 'upper' xattr pointing to the real upper dir. Verifying that the upper dir file handle is not stale is expensive, so only verify stale directory index entries on mount if NFS export feature is enabled. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:53 +01:00
Amir Goldstein	7db25d36d9	ovl: verify whiteout index entries on mount Whiteout index entries are used as an indication that an exported overlay file handle should be treated as stale (i.e. after unlink of the overlay inode). Check on mount that whiteout index entries have a name that looks like a valid file handle and cleanup invalid index entries. For whiteout index entries, do not check that they also have valid origin fh and nlink xattr, because those xattr do not exist for a whiteout index entry. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:53 +01:00
Amir Goldstein	ad1d615cec	ovl: use directory index entries for consistency verification A directory index is a directory type entry in index dir with a "trusted.overlay.upper" xattr containing an encoded ovl_fh of the merge directory upper dir inode. On lookup of non-dir files, lower file is followed by origin file handle. On lookup of dir entries, lower dir is found by name and then compared to origin file handle. We only trust dir index if we verified that lower dir matches origin file handle, otherwise index may be inconsistent and we ignore it. If we find an indexed non-upper dir or an indexed merged dir, whose index 'upper' xattr points to a different upper dir, that means that the lower directory may be also referenced by another upper dir via redirect, so we fail the lookup on inconsistency error. To be consistent with directory index entries format, the association of index dir to upper root dir, that was stored by older kernels in "trusted.overlay.origin" xattr is now stored in "trusted.overlay.upper" xattr. This also serves as an indication that overlay was mounted with a kernel that support index directory entries. For backward compatibility, if an 'origin' xattr exists on the index dir we also verify it on mount. Directory index entries are going to be used for NFS export. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:52 +01:00
Amir Goldstein	86eaa13046	ovl: unbless lower st_ino of unverified origin On a malformed overlay, several redirected dirs can point to the same dir on a lower layer. This presents a similar challenge as broken hardlinks, because different objects in the overlay can return the same st_ino/st_dev pair from stat(2). For broken hardlinks, we do not provide constant st_ino on copy up to avoid this inconsistency. When NFS export feature is enabled, apply the same logic to files and directories with unverified lower origin. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:52 +01:00
Amir Goldstein	37b12916c0	ovl: verify stored origin fh matches lower dir When the NFS export feature is enabled, overlayfs implicitly enables the feature "verify_lower". When the "verify_lower" feature is enabled, a directory inode found in lower layer by name or by redirect_dir is verified against the file handle of the copy up origin that is stored in the upper layer. This introduces a change of behavior for the case of lower layer modification while overlay is offline. A lower directory created or moved offline under an exisitng upper directory, will not be merged with that upper directory. The NFS export feature should not be used after copying layers, because the new lower directory inodes would fail verification and won't be merged with upper directories. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:51 +01:00
Amir Goldstein	f168f1098d	ovl: add support for "nfs_export" configuration Introduce the "nfs_export" config, module and mount options. The NFS export feature depends on the "index" feature and enables two implicit overlayfs features: "index_all" and "verify_lower". The "index_all" feature creates an index on copy up of every file and directory. The "verify_lower" feature uses the full index to detect overlay filesystems inconsistencies on lookup, like redirect from multiple upper dirs to the same lower dir. NFS export can be enabled for non-upper mount with no index. However, because lower layer redirects cannot be verified with the index, enabling NFS export support on an overlay with no upper layer requires turning off redirect follow (e.g. "redirect_dir=nofollow"). The full index may incur some overhead on mount time, especially when verifying that lower directory file handles are not stale. NFS export support, full index and consistency verification will be implemented by following patches. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:37 +01:00
Amir Goldstein	60b866420b	ovl: update documentation of inodes index feature Document that inode index feature solves breaking hard links on copy up. Simplify Kconfig backward compatibility disclaimer. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 10:20:02 +01:00
Amir Goldstein	051224438a	ovl: generalize ovl_verify_origin() and helpers Remove the "origin" language from the functions that handle set, get and verify of "origin" xattr and pass the xattr name as an argument. The same helpers are going to be used for NFS export to get, get and verify the "upper" xattr for directory index entries. ovl_verify_origin() is now a helper used only to verify non upper file handle stored in "origin" xattr of upper inode. The upper root dir file handle is still stored in "origin" xattr on the index dir for backward compatibility. This is going to be changed by the patch that adds directory index entries support. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 10:19:54 +01:00
Amir Goldstein	1eff1a1dee	ovl: simplify arguments to ovl_check_origin_fh() Pass the fs instance with lower_layers array instead of the dentry lowerstack array to ovl_check_origin_fh(), because the dentry members of lowerstack play no role in this helper. This change simplifies the argument list of ovl_check_origin(), ovl_cleanup_index() and ovl_verify_index(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 10:19:46 +01:00
Amir Goldstein	2e1a532883	ovl: factor out ovl_check_origin_fh() Re-factor ovl_check_origin() and ovl_get_origin(), so origin fh xattr is read from upper inode only once during lookup with multiple lower layers and only once when verifying index entry origin. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 10:19:35 +01:00
Amir Goldstein	d583ed7d13	ovl: store layer index in ovl_layer Store the fs root layer index inside ovl_layer struct, so we can get the root fs layer index from merge dir lower layer instead of find it with ovl_find_layer() helper. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 10:19:25 +01:00
Amir Goldstein	972d0093c2	ovl: force r/o mount when index dir creation fails When work dir creation fails, a warning is emitted and overlay is mounted r/o. Trying to remount r/w will fail with no work dir. When index dir creation fails, the same warning is emitted and overlay is mounted r/o, but trying to remount r/w will succeed. This may cause unintentional corruption of filesystem consistency. Adjust the behavior of index dir creation failure to that of work dir creation failure and do not allow to remount r/w. User needs to state an explicitly intention to work without an index by mounting with option 'index=off' to allow r/w mount with no index dir. When mounting with option 'index=on' and no 'upperdir', index is implicitly disabled, so do not warn about no file handle support. The issue was introduced with inodes index feature in v4.13, but this patch will not apply cleanly before ovl_fill_super() re-factoring in v4.15. Fixes: `02bcd15774` ("ovl: introduce the inodes index dir feature") Cc: <stable@vger.kernel.org> #v4.13 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 10:19:14 +01:00
Amir Goldstein	a683737ba9	ovl: disable index when no xattr support Overlayfs falls back to index=off if lower/upper fs does not support file handles. Do the same if upper fs does not support xattr. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 10:19:07 +01:00
Amir Goldstein	9678e63030	ovl: fix inconsistent d_ino for legacy merge dir For a merge dir that was copied up before v4.12 or that was hand crafted offline (e.g. mkdir {upper/lower}/dir), upper dir does not contain the 'trusted.overlay.origin' xattr. In that case, stat(2) on the merge dir returns the lower dir st_ino, but getdents(2) returns the upper dir d_ino. After this change, on merge dir lookup, missing origin xattr on upper dir will be fixed and 'impure' xattr will be fixed on parent of the legacy merge dir. Suggested-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 10:18:19 +01:00
David S. Miller	5ca114400d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net en_rx_am.c was deleted in 'net-next' but had a bug fixed in it in 'net'. The esp{4,6}_offload.c conflicts were overlapping changes. The 'out' label is removed so we just return ERR_PTR(-EINVAL) directly. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-23 13:51:56 -05:00
Bob Peterson	805c090750	GFS2: Log the reason for log flushes in every log header This patch just adds the capability for GFS2 to track which function called gfs2_log_flush. This should make it easier to diagnose problems based on the sequence of events found in the journals. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>	2018-01-23 07:39:20 -07:00
Bob Peterson	c1696fb85d	GFS2: Introduce new gfs2_log_header_v2 This patch adds a new structure called gfs2_log_header_v2 which is used to store expanded fields into previously unused areas of the log headers (i.e., this change is backwards compatible). Some of these are used for debug purposes so we can backtrack when problems occur. Others are reserved for future expansion. This patch is based on a prototype from Steve Whitehouse. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2018-01-23 07:38:53 -07:00
Greg Kroah-Hartman	78fae52cf4	sysfs: remove DEBUG defines It isn't needed at all in these files, dynamic debug is the best way to enable this type of thing, if you really want it. As it is, these defines were not doing anything at all. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-23 10:19:23 +01:00
Greg Kroah-Hartman	619daeeeb8	sysfs: use SPDX identifiers Move the license "mark" of the sysfs files to be in SPDX form, instead of the custom text that it currently is in. This is in a quest to get rid of the 700+ different ways we say "GPLv2" in the kernel tree. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-23 10:19:10 +01:00
Ben Hutchings	1995266727	nfsd: auth: Fix gid sorting when rootsquash enabled Commit `bdcf0a423e` ("kernel: make groups_sort calling a responsibility group_info allocators") appears to break nfsd rootsquash in a pretty major way. It adds a call to groups_sort() inside the loop that copies/squashes gids, which means the valid gids are sorted along with the following garbage. The net result is that the highest numbered valid gids are replaced with any lower-valued garbage gids, possibly including 0. We should sort only once, after filling in all the gids. Fixes: `bdcf0a423e` ("kernel: make groups_sort calling a responsibility ...") Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Acked-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-01-22 20:13:07 -08:00
Jaegeuk Kim	f236792311	f2fs: allow to recover node blocks given updated checkpoint If fsck.f2fs changes crc, we have no way to recover some inode blocks by roll- forward recovery. Let's relax the condition to recover them. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:59 -08:00
Jaegeuk Kim	37a086f015	f2fs: recover some i_inline flags This fixes lost i_inline flags during roll-forward. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:58 -08:00
Daeho Jeong	b2c4692bc2	f2fs: correct removexattr behavior for null valued extended attribute __vfs_removexattr() transfers "NULL" value to the setxattr handler of the f2fs filesystem in order to remove the extended attribute. But, __f2fs_setxattr() just ignores the removal request when the value of the extended attribute is already NULL. We have to remove the extended attribute itself even if the value of that is already NULL. We can reporduce this bug with the below: 1. touch file 2. setfattr -n "user.foo" file 3. setfattr -x "user.foo" file 4. getfattr -d file > user.foo Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com> Tested-by: Hobin Woo <hobin.woo@samsung.com> Tested-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:57 -08:00
Chao Yu	db198ae0f8	f2fs: drop page cache after fs shutdown Don't remain dirtied page cache in f2fs after shutdown, it can mitigate memory pressure of whole system, in order to keep other modules working properly. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:56 -08:00
Chao Yu	7950e9ac63	f2fs: stop gc/discard thread after fs shutdown Once filesystem shuts down, daemons like gc/discard thread should be aware of it, and do exit, in addtion, drop all cached pending discard commands and turn off real-time discard mode. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:55 -08:00
Chao Yu	d027c48447	f2fs: hanlde error case in f2fs_ioc_shutdown This patch makes f2fs_ioc_shutdown handling error case correctly. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:54 -08:00
Chao Yu	bb9e3bb8db	f2fs: split need_inplace_update This patch splits need_inplace_update to two functions: a. should_update_inplace() includes all conditions that we must use IPU. b. should_update_outplace() includes all conditions that we must use OPU. So that, in f2fs_ioc_set_pin_file() and f2fs_defragment_range(), we can use corresponding function to check whether we can trigger OPU/IPU or not. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:53 -08:00
Chao Yu	eb4497975e	f2fs: fix to update last_disk_size correctly This patch fixes to update last_disk_size only when writing out page successfully. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:52 -08:00
Chao Yu	b323fd28bb	f2fs: kill F2FS_INLINE_XATTR_ADDRS for cleanup Use get_inline_xattr_addrs directly instead of F2FS_INLINE_XATTR_ADDRS. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:51 -08:00
Chao Yu	d7997e6368	f2fs: clean up error path of fill_super This patch cleans up error path of fille_super to avoid unneeded release step. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:50 -08:00
Sheng Yong	a9d572c755	f2fs: avoid hungtask when GC encrypted block if io_bits is set When io_bits is set, GCing encrypted block may hit the following hungtask. Since io_bits requires aligned block address, f2fs_submit_page_write may return -EAGAIN if new_blkaddr does not satisify io_bits alignment. As a result, the encrypted page will never be writtenback. This patch makes move_data_block aware the EAGAIN error and cancel the writeback. [ 246.751371] INFO: task kworker/u4:4:797 blocked for more than 90 seconds. [ 246.752423] Not tainted 4.15.0-rc4+ #11 [ 246.754176] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 246.755336] kworker/u4:4 D25448 797 2 0x80000000 [ 246.755597] Workqueue: writeback wb_workfn (flush-7:0) [ 246.755616] Call Trace: [ 246.755695] ? __schedule+0x322/0xa90 [ 246.755761] ? blk_init_request_from_bio+0x120/0x120 [ 246.755773] ? pci_mmcfg_check_reserved+0xb0/0xb0 [ 246.755801] ? __radix_tree_create+0x19e/0x200 [ 246.755813] ? delete_node+0x136/0x370 [ 246.755838] schedule+0x43/0xc0 [ 246.755904] io_schedule+0x17/0x40 [ 246.755939] wait_on_page_bit_common+0x17b/0x240 [ 246.755950] ? wake_page_function+0xa0/0xa0 [ 246.755961] ? add_to_page_cache_lru+0x160/0x160 [ 246.755972] ? page_cache_tree_insert+0x170/0x170 [ 246.755983] ? __lru_cache_add+0x96/0xb0 [ 246.756086] __filemap_fdatawait_range+0x14f/0x1c0 [ 246.756097] ? wait_on_page_bit_common+0x240/0x240 [ 246.756120] ? __wake_up_locked_key_bookmark+0x20/0x20 [ 246.756167] ? wait_on_all_pages_writeback+0xc9/0x100 [ 246.756179] ? __remove_ino_entry+0x120/0x120 [ 246.756192] ? wait_woken+0x100/0x100 [ 246.756204] filemap_fdatawait_range+0x9/0x20 [ 246.756216] write_checkpoint+0x18a1/0x1f00 [ 246.756254] ? blk_get_request+0x10/0x10 [ 246.756265] ? cpumask_next_and+0x43/0x60 [ 246.756279] ? f2fs_sync_inode_meta+0x160/0x160 [ 246.756289] ? remove_element.isra.4+0xa0/0xa0 [ 246.756300] ? __put_compound_page+0x40/0x40 [ 246.756310] ? f2fs_sync_fs+0xec/0x1c0 [ 246.756320] ? f2fs_sync_fs+0x120/0x1c0 [ 246.756329] f2fs_sync_fs+0x120/0x1c0 [ 246.756357] ? trace_event_raw_event_f2fs__page+0x260/0x260 [ 246.756393] ? ata_build_rw_tf+0x173/0x410 [ 246.756397] f2fs_balance_fs_bg+0x198/0x390 [ 246.756405] ? drop_inmem_page+0x230/0x230 [ 246.756415] ? ahci_qc_prep+0x1bb/0x2e0 [ 246.756418] ? ahci_qc_issue+0x1df/0x290 [ 246.756422] ? __accumulate_pelt_segments+0x42/0xd0 [ 246.756426] ? f2fs_write_node_pages+0xd1/0x380 [ 246.756429] f2fs_write_node_pages+0xd1/0x380 [ 246.756437] ? sync_node_pages+0x8f0/0x8f0 [ 246.756440] ? update_curr+0x53/0x220 [ 246.756444] ? __accumulate_pelt_segments+0xa2/0xd0 [ 246.756448] ? __update_load_avg_se.isra.39+0x349/0x360 [ 246.756452] ? do_writepages+0x2a/0xa0 [ 246.756456] do_writepages+0x2a/0xa0 [ 246.756460] __writeback_single_inode+0x70/0x490 [ 246.756463] ? check_preempt_wakeup+0x199/0x310 [ 246.756467] writeback_sb_inodes+0x2a2/0x660 [ 246.756471] ? is_empty_dir_inode+0x40/0x40 [ 246.756474] ? __writeback_single_inode+0x490/0x490 [ 246.756477] ? string+0xbf/0xf0 [ 246.756480] ? down_read_trylock+0x35/0x60 [ 246.756484] __writeback_inodes_wb+0x9f/0xf0 [ 246.756488] wb_writeback+0x41d/0x4b0 [ 246.756492] ? writeback_inodes_wb.constprop.55+0x150/0x150 [ 246.756498] ? set_worker_desc+0xf7/0x130 [ 246.756502] ? current_is_workqueue_rescuer+0x60/0x60 [ 246.756511] ? _find_next_bit+0x2c/0xa0 [ 246.756514] ? wb_workfn+0x400/0x5d0 [ 246.756518] wb_workfn+0x400/0x5d0 [ 246.756521] ? finish_task_switch+0xdf/0x2a0 [ 246.756525] ? inode_wait_for_writeback+0x30/0x30 [ 246.756529] process_one_work+0x3a7/0x6f0 [ 246.756533] worker_thread+0x82/0x750 [ 246.756537] kthread+0x16f/0x1c0 [ 246.756541] ? trace_event_raw_event_workqueue_work+0x110/0x110 [ 246.756544] ? kthread_create_worker_on_cpu+0xb0/0xb0 [ 246.756548] ret_from_fork+0x1f/0x30 Signed-off-by: Sheng Yong <shengyong1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:49 -08:00
Jaegeuk Kim	d8a9a22992	f2fs: allow quota to use reserved blocks This patch allows quota to use reserved blocks all the time. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:48 -08:00
Chao Yu	a2e2e76b23	f2fs: fix to drop all inmem pages correctly In commit `57864ae5ce` ("f2fs: limit # of inmemory pages"), we have limited memory footprint of all inmem pages with 20% of total memory, otherwise, if we exceed the threshold, we will try to drop all inmem pages to avoid excessive memory pressure resulting in performance regression. But in some unrelated error paths, we will also drop all inmem pages, which should be wrong, fix it in this patch. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:47 -08:00
Chao Yu	f3d98e74fc	f2fs: speed up defragment on sparse file We have supported to get next page offset with valid mapping crossing hole in f2fs_map_blocks, utilizing it to speed up defragment on sparse file. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:46 -08:00
Chao Yu	c4020b2da4	f2fs: support F2FS_IOC_PRECACHE_EXTENTS This patch introduces a new ioctl F2FS_IOC_PRECACHE_EXTENTS to precache extent info like ext4, in order to gain better performance during triggering AIO by eliminating synchronous waiting of mapping info. Referred commit: `7869a4a6c5` ("ext4: add support for extent pre-caching") In addition, with newly added extent precache abilitiy, this patch add to support FIEMAP_FLAG_CACHE in ->fiemap. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:45 -08:00
Jaegeuk Kim	1ad71a2712	f2fs: add an ioctl to disable GC for specific file This patch gives a flag to disable GC on given file, which would be useful, when user wants to keep its block map. It also conducts in-place-update for dontmove file. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-01-22 14:56:35 -08:00
Martin Brandenburg	a0ec1ded22	orangefs: initialize op on loop restart in orangefs_devreq_read In orangefs_devreq_read, there is a loop which picks an op off the list of pending ops. If the loop fails to find an op, there is nothing to read, and it returns EAGAIN. If the op has been given up on, the loop is restarted via a goto. The bug is that the variable which the found op is written to is not reinitialized, so if there are no more eligible ops on the list, the code runs again on the already handled op. This is triggered by interrupting a process while the op is being copied to the client-core. It's a fairly small window, but it's there. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-01-22 13:51:14 -08:00
Martin Brandenburg	0afc0decf2	orangefs: use list_for_each_entry_safe in purge_waiting_ops set_op_state_purged can delete the op. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-01-22 13:51:14 -08:00
Anand Jain	f2788d2f76	btrfs: set the total_devices in device_list_add() There is no other parent for device_list_add() except for btrfs_scan_one_device(), which would set btrfs_fs_devices::total_devices if device_list_add is successful and this can be done with in device_list_add() itself. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 20:25:56 +01:00
Anand Jain	327f18cc7f	btrfs: move pr_info into device_list_add Commit `60999ca4b4` ("btrfs: make device scan less noisy") adds return value 1 to device_list_add(), so that parent function can call pr_info only when new device is added. Move the pr_info() part into device_list_add() so that this function can be kept simple. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 20:25:54 +01:00
Anand Jain	d8367db30a	btrfs: make btrfs_free_stale_devices() to match the path The btrfs_free_stale_devices() is updated to match for the given device path and delete it. (It searches for only unmounted list of devices.) Also drop the comment about different path being used for the same device, since now we will have cli to clean any device that's not a concern any more. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 20:25:52 +01:00
Anand Jain	0d34097f66	btrfs: rename btrfs_free_stale_devices() arg to skip_dev No functional changes. Rename btrfs_free_stale_devices() arg to skip_dev, so that it reflects what that arg for. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 20:25:50 +01:00
Anand Jain	522f1b45e4	btrfs: make btrfs_free_stale_devices() argument optional This updates btrfs_free_stale_devices() helper function to delete all unmouted devices, when arg is NULL. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 20:25:48 +01:00
Anand Jain	38cf665d33	btrfs: make btrfs_free_stale_device() to iterate all stales Let the list iterator iterate further and find other stale devices and delete it. This is in preparation to add support for user land request-able stale devices cleanup. Also rename btrfs_free_stale_device() to btrfs_free_stale_devices(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 20:25:47 +01:00
Anand Jain	a848b3e547	btrfs: no need to check for btrfs_fs_devices::seeding There is no need to check for btrfs_fs_devices::seeding when we have checked for btrfs_fs_devices::opened, because we can't sprout without its seed FS being opened. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 20:25:44 +01:00
Greg Kroah-Hartman	5d54f948aa	sysfs: turn WARN() into pr_warn() It's not good to crash the machine if panic_on_warn() is set just because someone made a stupid mistake of trying to create a sysfs file with the same name of an existing one. This makes the automated testing tools a lot harder to find the real bugs in the kernel. So just print a warning out and dump the stack to get the attention of the developer that they did something foolish. Then keep on trucking, as this should not be a fatal error at all. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-22 16:11:12 +01:00
Nikolay Borisov	b03ebd992f	btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it No functional changes, just makes the code more readable Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:22 +01:00
Liu Bo	5f4791f4a6	Btrfs: noinline merge_extent_mapping In order to debug subtle bugs around merge_extent_mapping(), perf probe can be used to check the arguments, but sometimes merge_extent_mapping() got inlined by compiler and couldn't be probed. This is adding noinline attribute to merge_extent_mapping(). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:22 +01:00
Liu Bo	9a7e10e7ba	Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping This is a subtle case, so in order to understand the problem, it'd be good to know the content of existing and em when any error occurs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:22 +01:00
Liu Bo	cd77f4f836	Btrfs: extent map selftest: dio write vs dio read This test case simulates the racy situation of dio write vs dio read, and see if btrfs_get_extent() would return -EEXIST. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:22 +01:00
Liu Bo	fd87526fad	Btrfs: extent map selftest: buffered write vs dio read This test case simulates the racy situation of buffered write vs dio read, and see if btrfs_get_extent() would return -EEXIST. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:22 +01:00
Liu Bo	72b28077a2	Btrfs: add extent map selftests We've observed that btrfs_get_extent() and merge_extent_mapping() could return -EEXIST in several cases, and they are caused by some racy condition, e.g dio read vs dio write, which makes the problem very tricky to reproduce. This adds extent map selftests in order to simulate those racy situations. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> [ minor string adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:22 +01:00
Liu Bo	c04e61b5e4	Btrfs: move extent map specific code to extent_map.c These helpers are extent map specific, move them to extent_map.c. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:22 +01:00
Liu Bo	7b4df058b0	Btrfs: add helper for em merge logic This is a prepare work for the following extent map selftest, which runs tests against em merge logic. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:21 +01:00
Liu Bo	18e83ac75b	Btrfs: fix unexpected EEXIST from btrfs_get_extent This fixes a corner case that is caused by a race of dio write vs dio read/write. Here is how the race could happen. Suppose that no extent map has been loaded into memory yet. There is a file extent [0, 32K), two jobs are running concurrently against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio read from [0, 4K) or [4K, 8K). t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K). ------------------------------------------------------ t1 t2 btrfs_get_blocks_direct() btrfs_get_blocks_direct() -> btrfs_get_extent() -> btrfs_get_extent() -> lookup_extent_mapping() -> add_extent_mapping() -> lookup_extent_mapping() # load [0, 32K) -> btrfs_new_extent_direct() -> btrfs_drop_extent_cache() # split [0, 32K) and # drop [8K, 32K) -> add_extent_mapping() # add [8K, 32K) -> add_extent_mapping() # handle -EEXIST when adding # [0, 32K) ------------------------------------------------------ About how t2(dio read/write) runs into -EEXIST: a) add_extent_mapping() gets -EEXIST for adding em [0, 32k), b) search_extent_mapping() then returns [0, 8k) as the existing em, even though start == existing->start, em is [0, 32k) so that extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k, c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k) (with a length 0) and returns -EEXIST as [8k, 32k) is already in tree, d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write, which is confusing applications. Here I conclude all the possible situations, 1) start < existing->start +-----------+em+-----------+ +--prev---+ \| +-------------+ \| \| \| \| \| \| \| +---------+ + +---+existing++ ++ + \| + start 2) start == existing->start +------------em------------+ \| +-------------+ \| \| \| \| \| + +----existing-+ + \| \| + start 3) start > existing->start && start < (existing->start + existing->len) +------------em------------+ \| +-------------+ \| \| \| \| \| + +----existing-+ + \| \| + start 4) start >= (existing->start + existing->len) +-----------+em+-----------+ \| +-------------+ \| +--next---+ \| \| \| \| \| \| + +---+existing++ + +---------+ + \| + start As we can see, it turns out that if start is within existing em (front inclusive), then the existing em should be returned as is, otherwise, we try our best to merge candidate em with sibling ems to form a larger em (in order to reduce the total number of em). Reported-by: David Vallender <david.vallender@landmark.co.uk> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:21 +01:00
Liu Bo	a520a7e0b5	Btrfs: fix incorrect block_len in merge_extent_mapping %block_len could be checked on deciding if two em are mergeable. merge_extent_mapping() has only added the front pad if the front part of em gets truncated, but it's possible that the end part gets truncated. For both compressed extent and inline extent, em->block_len is not adjusted accordingly, and for regular extent, em->block_len always equals to em->len, hence this sets em->block_len with em->len. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:21 +01:00
Matthew Wilcox	3cbf26da5e	btrfs: Remove unused readahead spinlock The reada_lock in struct btrfs_device was only initialised, and not actually used. That's good because there's another lock also called reada_lock in the btrfs_fs_info that was quite heavily used. Remove this one. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:21 +01:00
Liu Bo	7583d8d088	Btrfs: raid56: fix race between merge_bio and rbio_orig_end_io Before rbio_orig_end_io() goes to free rbio, rbio may get merged with more bios from other rbios and rbio->bio_list becomes non-empty, in that case, these newly merged bios don't end properly. Once unlock_stripe() is done, rbio->bio_list will not be updated any more and we can call bio_endio() on all queued bios. It should only happen in error-out cases, the normal path of recover and full stripe write have already set RBIO_RMW_LOCKED_BIT to disable merge before doing IO, so rbio_orig_end_io() called by them doesn't have the above issue. Reported-by: Jérôme Carretero <cJ-ko@zougloub.eu> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:21 +01:00
Liu Bo	44ac474def	Btrfs: do not cache rbio pages if using raid6 recover Since raid6 recover tries all possible combinations of failed stripes, - when raid6 rebuild algorithm is used, i.e. raid6_datap_recov() and raid6_2data_recov(), it may change the in-memory content of failed stripes, if such a raid bio is cached, a later raid write rmw or recover can steal @stripe_pages from it instead of reading from disks, such that it carries the wrong content to do write rmw or recovery and ends up with corruption or recovery failures. - when raid5 rebuild algorithm is used, i.e. xor, raid bio can be cached because the only failed stripe which contains @rbio->bio_pages gets modified, others remain the same so that their in-memory content is consistent with their on-disk content. This adds a check to skip caching rbio if using raid6 recover. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:21 +01:00
Liu Bo	0198e5b707	Btrfs: raid56: iterate raid56 internal bio with bio_for_each_segment_all Bio iterated by set_bio_pages_uptodate() is raid56 internal one, so it will never be a BIO_CLONED bio, and since this is called by end_io functions, bio->bi_iter.bi_size is zero, we mustn't use bio_for_each_segment() as that is a no-op if bi_size is zero. Fixes: `6592e58c6b` ("Btrfs: fix write corruption due to bio cloning on raid5/6") Cc: <stable@vger.kernel.org> # v4.12-rc6+ Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:21 +01:00
Su Yue	df6703e15c	btrfs: correct wrong comment about magic number of index_cnt There is no function named btrfs_get_inode_index_count. Explanation for magic number index_cnt=2 in btrfs_new_inode() is actually located in btrfs_set_inode_index_count(). So replace 'btrfs_get_inode_index_count' in the comment by 'btrfs_set_inode_index_count'. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:21 +01:00
Nikolay Borisov	d2560ebd23	btrfs: Make btrfs_inode_rsv_release static It's not used outside of extent-tree so there is no reason to not be static. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:21 +01:00
Anand Jain	1c94da9dd9	btrfs: cleanup btrfs_free_stale_device() usage We call btrfs_free_stale_device() only when we alloc a new struct btrfs_device (ret=1), so move it closer to where we alloc the new device. Also drop the comments. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:21 +01:00
David Sterba	e2683fc9d2	btrfs: tree-check: reduce stack consumption in check_dir_item I've noticed that the updated item checker stack consumption increased dramatically in 542f5385e20cf97447 ("btrfs: tree-checker: Add checker for dir item") tree-checker.c:check_leaf +552 (176 -> 728) The array is 255 bytes long, dynamic allocation would slow down the sanity checks so it's more reasonable to keep it on-stack. Moving the variable to the scope of use reduces the stack usage again tree-checker.c:check_leaf -264 (728 -> 464) Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:21 +01:00
Xiongfeng Wang	6670d4c2d9	btrfs: use correct string length in DEV_INFO ioctl gcc-8 reports: fs/btrfs/ioctl.c: In function 'btrfs_ioctl': ./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified bound 1024 equals destination size [-Wstringop-truncation] We need one less byte or call strlcpy() to make it a nul-terminated string. This is done on the next line anyway, but we want to avoid the warning. Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:21 +01:00
Anand Jain	6f794e3c5c	btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPP It appears from the original commit [1] that there isn't any design specific reason not to fail the mount instead of just warning. This patch will change it to fail. [1] commit `319e4d0661` btrfs: Enhance super validation check Fixes: `319e4d0661` ("btrfs: Enhance super validation check") Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:21 +01:00
Anand Jain	e2731e5588	btrfs: define SUPER_FLAG_METADUMP_V2 btrfs-progs uses super flag bit BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34). So just define that in kernel so that we know its been used. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:21 +01:00
Liu Bo	a6f93c71d4	Btrfs: avoid losing data raid profile when deleting a device We've avoided data losing raid profile when doing balance, but it turns out that deleting a device could also result in the same problem. Say we have 3 disks, and they're created with '-d raid1' profile. - We have chunk P (the only data chunk on the empty btrfs). - Suppose that chunk P's two raid1 copies reside in disk A and disk B. - Now, 'btrfs device remove disk B' btrfs_rm_device() -> btrfs_shrink_device() -> btrfs_relocate_chunk() #relocate any chunk on disk B to other places. - Chunk P will be removed and a new chunk will be created to hold those data, but as chunk P is the only one holding raid1 profile, after it goes away, the new chunk will be created as single profile which is our default profile. This fixes the problem by creating an empty data chunk before relocating the data chunk. Metadata/System chunk are supposed to have non-zero bytes all the time so their raid profile is preserved. Reported-by: James Alandt <James.Alandt@wdc.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:20 +01:00
Filipe Manana	81fdf6382b	Btrfs: fix space leak after fallocate and zero range operations If we do a buffered write after a zero range operation that has an unaligned (with the filesystem's sector size) end which also falls within an unwritten (prealloc) extent that is currently beyond the inode's i_size, and the zero range operation has the flag FALLOC_FL_KEEP_SIZE, we end up leaking data and metadata space. This happens because when zeroing a range we call btrfs_truncate_block(), which does delalloc (loads the page and partially zeroes its content), and in the buffered write path we only clear existing delalloc space reservation for the range we are writing into if that range starts at an offset smaller then the inode's i_size, which makes sense since we can not have delalloc extents beyond the i_size, only unwritten extents are allowed. Example reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "falloc -k 428K 4K" /mnt/foobar $ xfs_io -c "fzero -k 0 430K" /mnt/foobar $ xfs_io -c "pwrite -S 0xaa 428K 4K" /mnt/foobar $ umount /mnt After the unmount we get the metadata and data space leaks reported in dmesg/syslog: [95794.602253] ------------[ cut here ]------------ [95794.603322] WARNING: CPU: 0 PID: 31496 at fs/btrfs/inode.c:9561 btrfs_destroy_inode+0x4e/0x206 [btrfs] [95794.605167] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs] [95794.613000] CPU: 0 PID: 31496 Comm: umount Tainted: G W 4.14.0-rc6-btrfs-next-54+ #1 [95794.614448] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [95794.615972] task: ffff880075aa0240 task.stack: ffffc90001734000 [95794.617114] RIP: 0010:btrfs_destroy_inode+0x4e/0x206 [btrfs] [95794.618001] RSP: 0018:ffffc90001737d00 EFLAGS: 00010202 [95794.618721] RAX: 0000000000000000 RBX: ffff880070fa1418 RCX: ffffc90001737c7c [95794.619645] RDX: 0000000175aa0240 RSI: 0000000000000001 RDI: ffff880070fa1418 [95794.620711] RBP: ffffc90001737d38 R08: 0000000000000000 R09: 0000000000000000 [95794.621932] R10: ffffc90001737c48 R11: ffff88007123e158 R12: ffff880075b6a000 [95794.623124] R13: ffff88006145c000 R14: ffff880070fa1418 R15: ffff880070c3b4a0 [95794.624188] FS: 00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [95794.625578] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95794.626522] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0 [95794.627647] Call Trace: [95794.628128] destroy_inode+0x3d/0x55 [95794.628573] evict+0x177/0x17e [95794.629010] dispose_list+0x50/0x71 [95794.629478] evict_inodes+0x132/0x141 [95794.630289] generic_shutdown_super+0x3f/0x10b [95794.630864] kill_anon_super+0x12/0x1c [95794.631383] btrfs_kill_super+0x16/0x21 [btrfs] [95794.631930] deactivate_locked_super+0x30/0x68 [95794.632539] deactivate_super+0x36/0x39 [95794.633200] cleanup_mnt+0x49/0x67 [95794.633818] __cleanup_mnt+0x12/0x14 [95794.634416] task_work_run+0x82/0xa6 [95794.634902] prepare_exit_to_usermode+0xe1/0x10c [95794.635525] syscall_return_slowpath+0x18c/0x1af [95794.636122] entry_SYSCALL_64_fastpath+0xab/0xad [95794.636834] RIP: 0033:0x7fa678cb99a7 [95794.637370] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [95794.638672] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7 [95794.639596] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90 [95794.640703] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015 [95794.641773] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64 [95794.643150] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160 [95794.644249] Code: ff 4c 8b a8 80 06 00 00 48 8b 87 c0 01 00 00 48 85 c0 74 02 0f ff 48 83 bb e0 02 00 00 00 74 02 0f ff 83 bb 3c ff ff ff 00 74 02 <0f> ff 83 bb 40 ff ff ff 00 74 02 0f ff 48 83 bb f8 fe ff ff 00 [95794.646929] ---[ end trace e95877675c6ec007 ]--- [95794.647751] ------------[ cut here ]------------ [95794.648509] WARNING: CPU: 0 PID: 31496 at fs/btrfs/inode.c:9562 btrfs_destroy_inode+0x59/0x206 [btrfs] [95794.649842] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs] [95794.654659] CPU: 0 PID: 31496 Comm: umount Tainted: G W 4.14.0-rc6-btrfs-next-54+ #1 [95794.655894] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [95794.657546] task: ffff880075aa0240 task.stack: ffffc90001734000 [95794.658433] RIP: 0010:btrfs_destroy_inode+0x59/0x206 [btrfs] [95794.659279] RSP: 0018:ffffc90001737d00 EFLAGS: 00010202 [95794.660054] RAX: 0000000000000000 RBX: ffff880070fa1418 RCX: ffffc90001737c7c [95794.660753] RDX: 0000000175aa0240 RSI: 0000000000000001 RDI: ffff880070fa1418 [95794.661513] RBP: ffffc90001737d38 R08: 0000000000000000 R09: 0000000000000000 [95794.662289] R10: ffffc90001737c48 R11: ffff88007123e158 R12: ffff880075b6a000 [95794.663393] R13: ffff88006145c000 R14: ffff880070fa1418 R15: ffff880070c3b4a0 [95794.664342] FS: 00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [95794.665673] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95794.666593] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0 [95794.667629] Call Trace: [95794.668065] destroy_inode+0x3d/0x55 [95794.668637] evict+0x177/0x17e [95794.669179] dispose_list+0x50/0x71 [95794.669830] evict_inodes+0x132/0x141 [95794.670416] generic_shutdown_super+0x3f/0x10b [95794.671103] kill_anon_super+0x12/0x1c [95794.671786] btrfs_kill_super+0x16/0x21 [btrfs] [95794.672552] deactivate_locked_super+0x30/0x68 [95794.673393] deactivate_super+0x36/0x39 [95794.674107] cleanup_mnt+0x49/0x67 [95794.674706] __cleanup_mnt+0x12/0x14 [95794.675279] task_work_run+0x82/0xa6 [95794.675795] prepare_exit_to_usermode+0xe1/0x10c [95794.676507] syscall_return_slowpath+0x18c/0x1af [95794.677275] entry_SYSCALL_64_fastpath+0xab/0xad [95794.678006] RIP: 0033:0x7fa678cb99a7 [95794.678600] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [95794.679739] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7 [95794.680779] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90 [95794.681837] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015 [95794.682867] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64 [95794.683891] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160 [95794.684843] Code: c0 01 00 00 48 85 c0 74 02 0f ff 48 83 bb e0 02 00 00 00 74 02 0f ff 83 bb 3c ff ff ff 00 74 02 0f ff 83 bb 40 ff ff ff 00 74 02 <0f> ff 48 83 bb f8 fe ff ff 00 74 02 0f ff 48 83 bb 00 ff ff ff [95794.687156] ---[ end trace e95877675c6ec008 ]--- [95794.687876] ------------[ cut here ]------------ [95794.688579] WARNING: CPU: 0 PID: 31496 at fs/btrfs/inode.c:9565 btrfs_destroy_inode+0x7d/0x206 [btrfs] [95794.689735] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs] [95794.695015] CPU: 0 PID: 31496 Comm: umount Tainted: G W 4.14.0-rc6-btrfs-next-54+ #1 [95794.696396] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [95794.697956] task: ffff880075aa0240 task.stack: ffffc90001734000 [95794.698925] RIP: 0010:btrfs_destroy_inode+0x7d/0x206 [btrfs] [95794.699763] RSP: 0018:ffffc90001737d00 EFLAGS: 00010206 [95794.700434] RAX: 0000000000000000 RBX: ffff880070fa1418 RCX: ffffc90001737c7c [95794.701445] RDX: 0000000175aa0240 RSI: 0000000000000001 RDI: ffff880070fa1418 [95794.702448] RBP: ffffc90001737d38 R08: 0000000000000000 R09: 0000000000000000 [95794.703557] R10: ffffc90001737c48 R11: ffff88007123e158 R12: ffff880075b6a000 [95794.704441] R13: ffff88006145c000 R14: ffff880070fa1418 R15: ffff880070c3b4a0 [95794.705270] FS: 00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [95794.706341] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95794.707001] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0 [95794.708030] Call Trace: [95794.708466] destroy_inode+0x3d/0x55 [95794.709071] evict+0x177/0x17e [95794.709497] dispose_list+0x50/0x71 [95794.709973] evict_inodes+0x132/0x141 [95794.710564] generic_shutdown_super+0x3f/0x10b [95794.711200] kill_anon_super+0x12/0x1c [95794.711633] btrfs_kill_super+0x16/0x21 [btrfs] [95794.712139] deactivate_locked_super+0x30/0x68 [95794.712608] deactivate_super+0x36/0x39 [95794.713093] cleanup_mnt+0x49/0x67 [95794.713514] __cleanup_mnt+0x12/0x14 [95794.713933] task_work_run+0x82/0xa6 [95794.714543] prepare_exit_to_usermode+0xe1/0x10c [95794.715247] syscall_return_slowpath+0x18c/0x1af [95794.715952] entry_SYSCALL_64_fastpath+0xab/0xad [95794.716653] RIP: 0033:0x7fa678cb99a7 [95794.721100] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [95794.722052] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7 [95794.722856] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90 [95794.723698] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015 [95794.724736] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64 [95794.725928] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160 [95794.726728] Code: 40 ff ff ff 00 74 02 0f ff 48 83 bb f8 fe ff ff 00 74 02 0f ff 48 83 bb 00 ff ff ff 00 74 02 0f ff 48 83 bb 30 ff ff ff 00 74 02 <0f> ff 48 83 bb 08 ff ff ff 00 74 02 0f ff 4d 85 e4 0f 84 52 01 [95794.729203] ---[ end trace e95877675c6ec009 ]--- [95794.841054] ------------[ cut here ]------------ [95794.841829] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:5831 btrfs_free_block_groups+0x235/0x36a [btrfs] [95794.843425] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs] [95794.850658] CPU: 0 PID: 31496 Comm: umount Tainted: G W 4.14.0-rc6-btrfs-next-54+ #1 [95794.852590] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [95794.854752] task: ffff880075aa0240 task.stack: ffffc90001734000 [95794.855812] RIP: 0010:btrfs_free_block_groups+0x235/0x36a [btrfs] [95794.856811] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206 [95794.857805] RAX: 0000000080000000 RBX: ffff88006145c000 RCX: 0000000000000001 [95794.859014] RDX: 00000001810af668 RSI: 0000000000000002 RDI: 00000000ffffffff [95794.860270] RBP: ffffc90001737d98 R08: 0000000000000000 R09: ffffffff817e22b9 [95794.861525] R10: ffffc90001737c80 R11: 00000000000337fd R12: 0000000000000000 [95794.862700] R13: ffff88006145c0c0 R14: ffff88021b61a800 R15: ffff88006145c100 [95794.863810] FS: 00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [95794.865149] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95794.866099] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0 [95794.867198] Call Trace: [95794.867626] close_ctree+0x1db/0x2b8 [btrfs] [95794.868188] ? evict_inodes+0x132/0x141 [95794.869037] btrfs_put_super+0x15/0x17 [btrfs] [95794.870400] generic_shutdown_super+0x6a/0x10b [95794.871262] kill_anon_super+0x12/0x1c [95794.872046] btrfs_kill_super+0x16/0x21 [btrfs] [95794.872746] deactivate_locked_super+0x30/0x68 [95794.873687] deactivate_super+0x36/0x39 [95794.874639] cleanup_mnt+0x49/0x67 [95794.875504] __cleanup_mnt+0x12/0x14 [95794.876126] task_work_run+0x82/0xa6 [95794.876788] prepare_exit_to_usermode+0xe1/0x10c [95794.877777] syscall_return_slowpath+0x18c/0x1af [95794.878381] entry_SYSCALL_64_fastpath+0xab/0xad [95794.878888] RIP: 0033:0x7fa678cb99a7 [95794.879307] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [95794.880204] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7 [95794.881640] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90 [95794.882690] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015 [95794.883538] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64 [95794.884562] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160 [95794.885664] Code: 89 ef e8 07 ec 32 e1 e8 9d c0 ea e0 48 8d b3 28 02 00 00 48 83 c9 ff 31 d2 48 89 df e8 29 c5 ff ff 48 83 bb 80 02 00 00 00 74 02 <0f> ff 48 83 bb 88 02 00 00 00 74 02 0f ff 48 83 bb d8 02 00 00 [95794.887980] ---[ end trace e95877675c6ec00a ]--- [95794.888739] ------------[ cut here ]------------ [95794.889405] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:5832 btrfs_free_block_groups+0x241/0x36a [btrfs] [95794.891020] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs] [95794.897551] CPU: 0 PID: 31496 Comm: umount Tainted: G W 4.14.0-rc6-btrfs-next-54+ #1 [95794.898509] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [95794.899685] task: ffff880075aa0240 task.stack: ffffc90001734000 [95794.900592] RIP: 0010:btrfs_free_block_groups+0x241/0x36a [btrfs] [95794.901387] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206 [95794.902300] RAX: 0000000080000000 RBX: ffff88006145c000 RCX: 0000000000000001 [95794.903260] RDX: 00000001810af668 RSI: 0000000000000002 RDI: 00000000ffffffff [95794.904332] RBP: ffffc90001737d98 R08: 0000000000000000 R09: ffffffff817e22b9 [95794.905300] R10: ffffc90001737c80 R11: 00000000000337fd R12: 0000000000000000 [95794.906439] R13: ffff88006145c0c0 R14: ffff88021b61a800 R15: ffff88006145c100 [95794.907459] FS: 00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [95794.908625] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95794.909511] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0 [95794.910630] Call Trace: [95794.911153] close_ctree+0x1db/0x2b8 [btrfs] [95794.911837] ? evict_inodes+0x132/0x141 [95794.912344] btrfs_put_super+0x15/0x17 [btrfs] [95794.912975] generic_shutdown_super+0x6a/0x10b [95794.913788] kill_anon_super+0x12/0x1c [95794.914424] btrfs_kill_super+0x16/0x21 [btrfs] [95794.915142] deactivate_locked_super+0x30/0x68 [95794.915831] deactivate_super+0x36/0x39 [95794.916433] cleanup_mnt+0x49/0x67 [95794.917045] __cleanup_mnt+0x12/0x14 [95794.917665] task_work_run+0x82/0xa6 [95794.918309] prepare_exit_to_usermode+0xe1/0x10c [95794.919021] syscall_return_slowpath+0x18c/0x1af [95794.919722] entry_SYSCALL_64_fastpath+0xab/0xad [95794.920426] RIP: 0033:0x7fa678cb99a7 [95794.921039] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [95794.922303] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7 [95794.923335] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90 [95794.924364] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015 [95794.925435] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64 [95794.926533] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160 [95794.927557] Code: 48 8d b3 28 02 00 00 48 83 c9 ff 31 d2 48 89 df e8 29 c5 ff ff 48 83 bb 80 02 00 00 00 74 02 0f ff 48 83 bb 88 02 00 00 00 74 02 <0f> ff 48 83 bb d8 02 00 00 00 74 02 0f ff 48 83 bb e0 02 00 00 [95794.930166] ---[ end trace e95877675c6ec00b ]--- [95794.930961] ------------[ cut here ]------------ [95794.931727] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:9953 btrfs_free_block_groups+0x2bc/0x36a [btrfs] [95794.932729] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs] [95794.938394] CPU: 0 PID: 31496 Comm: umount Tainted: G W 4.14.0-rc6-btrfs-next-54+ #1 [95794.939842] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [95794.941455] task: ffff880075aa0240 task.stack: ffffc90001734000 [95794.942336] RIP: 0010:btrfs_free_block_groups+0x2bc/0x36a [btrfs] [95794.943268] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206 [95794.944127] RAX: ffff8802004fd0e8 RBX: ffff88006145c000 RCX: 0000000000000001 [95794.945211] RDX: 00000001810af668 RSI: 0000000000000002 RDI: 00000000ffffffff [95794.946316] RBP: ffffc90001737d98 R08: 0000000000000000 R09: ffffffff817e22b9 [95794.947271] R10: ffffc90001737c80 R11: 00000000000337fd R12: ffff8802004fd0e8 [95794.948219] R13: ffff88006145c0c0 R14: ffff88006145e598 R15: ffff88006145c100 [95794.949193] FS: 00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [95794.950495] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95794.951338] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0 [95794.952361] Call Trace: [95794.952811] close_ctree+0x1db/0x2b8 [btrfs] [95794.953522] ? evict_inodes+0x132/0x141 [95794.954543] btrfs_put_super+0x15/0x17 [btrfs] [95794.955231] generic_shutdown_super+0x6a/0x10b [95794.955916] kill_anon_super+0x12/0x1c [95794.956414] btrfs_kill_super+0x16/0x21 [btrfs] [95794.956953] deactivate_locked_super+0x30/0x68 [95794.957635] deactivate_super+0x36/0x39 [95794.958256] cleanup_mnt+0x49/0x67 [95794.958701] __cleanup_mnt+0x12/0x14 [95794.959181] task_work_run+0x82/0xa6 [95794.959635] prepare_exit_to_usermode+0xe1/0x10c [95794.960182] syscall_return_slowpath+0x18c/0x1af [95794.960731] entry_SYSCALL_64_fastpath+0xab/0xad [95794.961438] RIP: 0033:0x7fa678cb99a7 [95794.961990] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [95794.963111] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7 [95794.963975] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90 [95794.964680] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015 [95794.965763] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64 [95794.966868] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160 [95794.967800] Code: 00 00 00 4c 8b a3 98 25 00 00 49 83 bc 24 60 ff ff ff 00 75 16 49 83 bc 24 68 ff ff ff 00 75 0b 49 83 bc 24 70 ff ff ff 00 74 16 <0f> ff 49 8d b4 24 18 ff ff ff 31 c9 31 d2 48 89 df e8 93 7a ff [95794.970629] ---[ end trace e95877675c6ec00c ]--- [95794.971451] BTRFS info (device sdi): space_info 1 has 7680000 free, is not full [95794.972351] BTRFS info (device sdi): space_info total=8388608, used=704512, pinned=0, reserved=0, may_use=4096, readonly=0 [95794.973595] ------------[ cut here ]------------ [95794.974353] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:9953 btrfs_free_block_groups+0x2bc/0x36a [btrfs] [95794.980163] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs] [95794.986461] CPU: 0 PID: 31496 Comm: umount Tainted: G W 4.14.0-rc6-btrfs-next-54+ #1 [95794.987591] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [95794.988929] task: ffff880075aa0240 task.stack: ffffc90001734000 [95794.989922] RIP: 0010:btrfs_free_block_groups+0x2bc/0x36a [btrfs] [95794.990715] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206 [95794.991431] RAX: ffff88020f6e70e8 RBX: ffff88006145c000 RCX: ffffffff8115a906 [95794.992455] RDX: ffffffff8115a902 RSI: ffff880075aa0b40 RDI: ffff880075aa0b40 [95794.993535] RBP: ffffc90001737d98 R08: 0000000000000020 R09: fffffffffffffff7 [95794.994573] R10: 00000000ffffffc4 R11: ffff8800633b1bc0 R12: ffff88020f6e70e8 [95794.996250] R13: 0000000000000038 R14: ffff88006145e598 R15: 0000000000000000 [95794.997233] FS: 00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [95794.998592] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95794.999484] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0 [95795.000542] Call Trace: [95795.001138] close_ctree+0x1db/0x2b8 [btrfs] [95795.001885] ? evict_inodes+0x132/0x141 [95795.002407] btrfs_put_super+0x15/0x17 [btrfs] [95795.003093] generic_shutdown_super+0x6a/0x10b [95795.003720] kill_anon_super+0x12/0x1c [95795.004353] btrfs_kill_super+0x16/0x21 [btrfs] [95795.005095] deactivate_locked_super+0x30/0x68 [95795.005716] deactivate_super+0x36/0x39 [95795.006388] cleanup_mnt+0x49/0x67 [95795.006939] __cleanup_mnt+0x12/0x14 [95795.007512] task_work_run+0x82/0xa6 [95795.008124] prepare_exit_to_usermode+0xe1/0x10c [95795.008994] syscall_return_slowpath+0x18c/0x1af [95795.009831] entry_SYSCALL_64_fastpath+0xab/0xad [95795.010610] RIP: 0033:0x7fa678cb99a7 [95795.011193] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [95795.012327] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7 [95795.013432] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90 [95795.014558] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015 [95795.015577] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64 [95795.016569] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160 [95795.017662] Code: 00 00 00 4c 8b a3 98 25 00 00 49 83 bc 24 60 ff ff ff 00 75 16 49 83 bc 24 68 ff ff ff 00 75 0b 49 83 bc 24 70 ff ff ff 00 74 16 <0f> ff 49 8d b4 24 18 ff ff ff 31 c9 31 d2 48 89 df e8 93 7a ff [95795.020538] ---[ end trace e95877675c6ec00d ]--- [95795.021259] BTRFS info (device sdi): space_info 4 has 1072775168 free, is not full [95795.022390] BTRFS info (device sdi): space_info total=1073741824, used=114688, pinned=0, reserved=0, may_use=786432, readonly=65536 Fix this by ensuring the zero range operation does not call btrfs_truncate_block() if the corresponding extent is an unwritten one (it's pointless anyway, since reading from an unwritten extent yields zeroes). Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:20 +01:00
Filipe Manana	9f13ce743b	Btrfs: fix missing inode i_size update after zero range operation For a fallocate's zero range operation that targets a range with an end that is not aligned to the sector size, we can end up not updating the inode's i_size. This happens when the last page of the range maps to an unwritten (prealloc) extent and before that last page we have either a hole or a written extent. This is because in this scenario we relied on a call to btrfs_prealloc_file_range() to update the inode's i_size, however it can only update the i_size to the "down aligned" end of the range. Example: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ xfs_io -f -c "pwrite -S 0xff 0 428K" /mnt/foobar $ xfs_io -c "falloc -k 428K 4K" /mnt/foobar $ xfs_io -c "fzero 0 430K" /mnt/foobar $ du --bytes /mnt/foobar 438272 /mnt/foobar The inode's i_size was left as 428Kb (438272 bytes) when it should have been updated to 430Kb (440320 bytes). Fix this by always updating the inode's i_size explicitly after zeroing the range. Fixes: ba6d5887946ff86d93dc ("Btrfs: add support for fallocate's zero range operation") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:20 +01:00
Filipe Manana	94f450712a	Btrfs: use cached state when dirtying pages during buffered write During a buffered IO write, we can have an extent state that we got when we locked the range (if the range starts at an offset lower than eof), so always pass it to btrfs_dirty_pages() so that setting the delalloc bit in the range does not need to do a full search in the inode's io tree, saving time and reducing the amount of time we hold the io tree's lock. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:20 +01:00
Filipe Manana	f27451f229	Btrfs: add support for fallocate's zero range operation This implements support the zero range operation of fallocate. For now at least it's as simple as possible while reusing most of the existing fallocate and hole punching infrastructure. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:20 +01:00
Liu Bo	cc54ff626a	Btrfs: do not merge rbios if their fail stripe index are not identical Since fail stripe index in rbio would be used to decide which algorithm reconstruction would be run, we cannot merge rbios if their's fail striped indexes are different, otherwise, one of the two reconstructions would fail. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:20 +01:00
Liu Bo	db34be19c4	Btrfs: remove redundant check in rbio_can_merge Given the above ' if (last->operation != cur->operation) return 0; ', it's guaranteed that two operations are same. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:20 +01:00
Anand Jain	05a5c55dfc	btrfs: minor style cleanups in btrfs_scan_one_device Assign ret = -EINVAL where it is actually required. Remove { } around single line if else code. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:20 +01:00
Anand Jain	c1f32b7c1f	btrfs: simplify mutex unlocking code in btrfs_commit_transaction No functional change rearrange the mutex_unlock. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ edit subject ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:20 +01:00
Anand Jain	cadbc0a067	btrfs: rename btrfs_device::scrub_device to scrub_ctx btrfs_device::scrub_device is not a device which is being scrubbed, but it holds the scrub context, so rename to reflect the same. No functional changes here. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:20 +01:00
Anand Jain	922ea8994a	btrfS: collapse btrfs_handle_error() into __btrfs_handle_fs_error() There is no other consumer for btrfs_handle_error() other than __btrfs_handle_fs_error(), further this function quite small. Merge it into its parent. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ reformat comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:20 +01:00
Anand Jain	61ecda6865	btrfs: remove check for BTRFS_FS_STATE_ERROR which we just set __btrfs_handle_fs_error() sets BTRFS_FS_STATE_ERROR, and calls btrfs_handle_error() so no need to check if the BTRFS_FS_STATE_ERROR is set in btrfs_handle_error(). And there is no other user of btrfs_handle_error() as well. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:20 +01:00
Liu Bo	8810f7517a	Btrfs: make raid6 rebuild retry more There is a scenario that can end up with rebuild process failing to return good content, i.e. suppose that all disks can be read without problems and if the content that was read out doesn't match its checksum, currently for raid6 btrfs at most retries twice, - the 1st retry is to rebuild with all other stripes, it'll eventually be a raid5 xor rebuild, - if the 1st fails, the 2nd retry will deliberately fail parity p so that it will do raid6 style rebuild, however, the chances are that another non-parity stripe content also has something corrupted, so that the above retries are not able to return correct content, and users will think of this as data loss. More seriouly, if the loss happens on some important internal btree roots, it could refuse to mount. This extends btrfs to do more retries and each retry fails only one stripe. Since raid6 can tolerate 2 disk failures, if there is one more failure besides the failure on which we're recovering, this can always work. The worst case is to retry as many times as the number of raid6 disks, but given the fact that such a scenario is really rare in practice, it's still acceptable. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:20 +01:00
Liu Bo	762221f095	Btrfs: fix scrub to repair raid6 corruption The raid6 corruption is that, suppose that all disks can be read without problems and if the content that was read out doesn't match its checksum, currently for raid6 btrfs at most retries twice, - the 1st retry is to rebuild with all other stripes, it'll eventually be a raid5 xor rebuild, - if the 1st fails, the 2nd retry will deliberately fail parity p so that it will do raid6 style rebuild, however, the chances are that another non-parity stripe content also has something corrupted, so that the above retries are not able to return correct content. We've fixed normal reads to rebuild raid6 correctly with more retries in Patch "Btrfs: make raid6 rebuild retry more"[1], this is to fix scrub to do the exactly same rebuild process. [1]: https://patchwork.kernel.org/patch/10091755/ Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:20 +01:00
Anand Jain	6528b99d3d	btrfs: factor btrfs_check_rw_degradable() to check given device Update btrfs_check_rw_degradable() to check against the given device if its lost. We can use this function to know if the volume is going to be in degraded mode OR failed state, when the given device fails. Which is needed when we are handling the device failed state. A preparatory patch does not affect the flow as such. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ enhance comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:20 +01:00
David Sterba	e43bbe5e16	btrfs: sink unlock_extent parameter gfp_flags All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the parameter to the function, though we lose some of the slightly better semantics of GFP_KERNEL in some places, it's worth cleaning up the callchains. Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:19 +01:00
David Sterba	d810a4be1a	btrfs: add separate helper for unlock_extent_cached with GFP_ATOMIC There's only one instance where we pass different gfp mask to unlock_extent_cached. Add a separate helper for that and then we can drop the gfp parameter from unlock_extent_cached. Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:19 +01:00
David Sterba	5bedc48a8f	btrfs: drop unused parameters from mount_subvol Recent patches reworking the mount path left some unused parameters. We pass a vfsmount to mount_subvol, the flags and data (ie. mount options) have been already applied and we will not need them. Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:19 +01:00
Misono, Tomohiro	e215772cd2	btrfs: cleanup unnecessary string dup in btrfs_parse_options() Long ago, commit `edf24abe51` ("btrfs: sanity mount option parsing and early mount code") split the btrfs_parse_options() into two parts (btrfs_parse_early_options() and btrfs_parse_options()). As a result, btrfs_parse_optins no longer gets called twice and is the last one to parse mount option string. Therefore there is no need to dup it. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:19 +01:00
Liu Bo	203e02d934	Btrfs: remove unused wait in btrfs_stripe_hash In fact nobody is waiting on @wait's waitqueue, it can be safely removed. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:19 +01:00
Nikolay Borisov	36f7894f66	btrfs: Remove redundant pair of bio_get/set in __btrfs_submit_dio_bio The bio is not referenced after it has been submitted and the endio is going to consume the sole reference on successful submission. On error, the callers of __btrfs_submit_dio_bio do invoke bio_put so we don't leak it either. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:19 +01:00
Nikolay Borisov	ffc9c8dd7d	btrfs: Remove redundant bio_get/bio_set pair from submit_one_bio The bio is never referenced after it has been submitted so there is no point in getting an extra reference. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:19 +01:00
Nikolay Borisov	ea057f6daf	btrfs: Remove redundant bio_get/set from submit_dio_repair_bio The bio that is passsed is the newly created repair bio which already has a reference count of 1, which is going to be consumed by the endio routine on successful submission. On error the handler also calls bio_put. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:19 +01:00
Nikolay Borisov	32506af595	btrfs: Remove redundant bio_get/set calls in compressed read/write paths bio_get/set is necessary only if the bio is going to be referenced following submissions. In the code paths where such calls are made we don't really need them since the bio is referenced only if btrfs_map_bio returns an error. And this function can return an error prior to submission only. So referencing the bio is safe. Furthermore we do call bio_endio which will consume the last reference. So let's remove the redundant calls. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:19 +01:00
Nikolay Borisov	4271ecea64	btrfs: Improve btrfs_search_slot description Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:19 +01:00
David Sterba	36243c9199	btrfs: heuristic: call get4bits directly As it's a single instance and local to the file, we don't need to pass it as an argument. Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:19 +01:00
David Sterba	7add17befc	btrfs: heuristic: open code copy_call callback of radix sort The callback is trivial and we don't need the abstraction for our purposes. Let's open code it. Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:19 +01:00
David Sterba	23ae8c63aa	btrfs: heuristic: open code get_num callback of radix sort The callback is trivial and we don't need the abstraction for our purposes. Let's open code it and also make the array types explicit. Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:19 +01:00
Misono, Tomohiro	78f6beacd0	btrfs: remove unused arg from parse_subvol_options() Remove unused arg 'holder' from parse_subvol_options(), which has been forgotten to be cleaned in the commit b99beb110e2d ("btrfs: split parse_early_options() in two"). Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:19 +01:00
Misono, Tomohiro	83085935cc	btrfs: remove unused setup_root_args() Since setup_root_args() is not used anymore, just remove it. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:19 +01:00
Misono, Tomohiro	d740760656	btrfs: split parse_early_options() in two Now parse_early_options() is used by both btrfs_mount() and btrfs_mount_root(). However, the former only needs subvol related part and the latter needs the others. Therefore extract the subvol related parts from parse_early_options() and move it to new parse function (parse_subvol_options()). Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:18 +01:00
Misono, Tomohiro	312c89fbca	btrfs: cleanup btrfs_mount() using btrfs_mount_root() Cleanup btrfs_mount() by using btrfs_mount_root(). This avoids getting btrfs_mount() called twice in mount path. Old btrfs_mount() will do: 0. VFS layer calls vfs_kern_mount() with registered file_system_type (for btrfs, btrfs_fs_type). btrfs_mount() is called on the way. 1. btrfs_parse_early_options() parses "subvolid=" mount option and set the value to subvol_objectid. Otherwise, subvol_objectid has the initial value of 0 2. check subvol_objectid is 5 or not. Assume this time id is not 5, then btrfs_mount() returns by calling mount_subvol() 3. In mount_subvol(), original mount options are modified to contain "subvolid=0" in setup_root_args(). Then, vfs_kern_mount() is called with btrfs_fs_type and new options 4. btrfs_mount() is called again 5. btrfs_parse_early_options() parses "subvolid=0" and set 5 (instead of 0) to subvol_objectid 6. check subvol_objectid is 5 or not. This time id is 5 and mount_subvol() is not called. btrfs_mount() finishes mounting a root 7. (in mount_subvol()) with using a return vale of vfs_kern_mount(), it calls mount_subtree() 8. return subvolume's dentry Reusing the same file_system_type (and btrfs_mount()) for vfs_kern_mount() is the cause of complication. Instead, new btrfs_mount() will do: 1. parse subvol id related options for later use in mount_subvol() 2. mount device's root by calling vfs_kern_mount() with btrfs_root_fs_type, which is not registered to VFS by register_filesystem(). As a result, btrfs_mount_root() is called 3. return by calling mount_subvol() The code of 2. is moved from the first part of mount_subvol(). The semantics of device holder changes from btrfs_fs_type to btrfs_root_fs_type and has to be used in all contexts. Otherwise we'd get wrong results when mount and dev scan would not check the same thing. (this has been found indendently and the fix is folded into this patch) Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ fold the btrfs_control_ioctl fixup, extend the comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:18 +01:00
Misono, Tomohiro	72fa39f5c7	btrfs: add btrfs_mount_root() and new file_system_type Add btrfs_mount_root() and new file_system_type for preparation of cleanup of btrfs_mount(). Code path is not changed yet. btrfs_mount_root() is almost the same as current btrfs_mount(), but doesn't have subvolume related part. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:18 +01:00

... 3 4 5 6 7 ...

52613 Commits