Commit Graph

64998 Commits

Author SHA1 Message Date
Linus Torvalds
07c8f3bfef \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl7Y2McACgkQnJ2qBz9k
 QNlHzwf/e4oz9oRCXPqBwh6C318nl6ksQO5ooW+Dhb535cr/Cn99nuZa3GrvW+aq
 eSbypsvZQMguk0/okEc4jcTgLmEw+KubpBXOi/DJZ9dzGQrvjT2nBkQmaTqwp9dO
 WMZcJLmszkrtokjKD4lVjyQArcwqQF/v/moEKIImw5A6CY4R4odTaUOCPnTwF7P6
 OXsDPwRfAccJ25ZUZ8hjc+fRl/Ncex6szciaJ08T4btlaAtc5UIn5Sy/u8BqNNiw
 0VRheD4sJ2c25hLOIQJ5RETIeuYaRcR/BA3vm+k1d2iIiw4ubj9+ppwiaWOryA9U
 5fXnBmXKuUUrwFihzmiLSckIpm3IPg==
 =kghV
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:
 "Several smaller fixes and cleanups for fsnotify subsystem"

* tag 'fsnotify_for_v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fanotify: fix ignore mask logic for events on child and on dir
  fanotify: don't write with size under sizeof(response)
  fsnotify: Remove proc_fs.h include
  fanotify: remove reference to fill_event_metadata()
  fsnotify: add mutex destroy
  fanotify: prefix should_merge()
  fanotify: Replace zero-length array with flexible-array
  inotify: Fix error return code assignment flow.
  fsnotify: Add missing annotation for fsnotify_finish_user_wait() and for fsnotify_prepare_user_wait()
2020-06-04 13:51:54 -07:00
Linus Torvalds
d77d1dbba9 zonefs changes for 5.8
Only one patch in this pull request to cleanup handling of uuid using
 the import_uuid() helper, from Andy.
 
 Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCXtg3xAAKCRDdoc3SxdoY
 dp+AAQCIAQpe4qyF5hJtwLPY+qffDDuHDxHjrERpA6c7fpKicgD+K6uDIwZ8Y6L8
 XXYPmKer58rV61jX4hvZGCAYwLmzRwA=
 =Vc+W
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs update from Damien Le Moal:
 "Only one patch in this pull request to cleanup handling of uuid using
  the import_uuid() helper, from Andy"

* tag 'zonefs-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: Replace uuid_copy() with import_uuid()
2020-06-04 13:50:13 -07:00
Steve French
331cc667a9 cifs: update internal module version number
To 2.27

Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-04 13:50:55 -05:00
Aurelien Aptel
2f58967979 cifs: multichannel: try to rebind when reconnecting a channel
first steps in trying to make channels properly reconnect.

* add cifs_ses_find_chan() function to find the enclosing cifs_chan
  struct it belongs to
* while we have the session lock and are redoing negprot and
  sess.setup in smb2_reconnect() redo the binding of channels.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-04 13:50:55 -05:00
Aurelien Aptel
8eec79540d cifs: multichannel: use pointer for binding channel
Add a cifs_chan pointer in struct cifs_ses that points to the channel
currently being bound if ses->binding is true.

Previously it was always the channel past the established count.

This will make reconnecting (and rebinding) a channel easier later on.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-04 13:50:55 -05:00
Steve French
edb1613536 smb3: remove static checker warning
Remove static checker warning pointed out by Dan Carpenter:

The patch feeaec621c09: "cifs: multichannel: move channel selection
above transport layer" from Apr 24, 2020, leads to the following
static checker warning:

        fs/cifs/smb2pdu.c:149 smb2_hdr_assemble()
        error: we previously assumed 'tcon->ses' could be null (see line 133)

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
CC: Aurelien Aptel <aptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-04 13:50:55 -05:00
Aurelien Aptel
352d96f3ac cifs: multichannel: move channel selection above transport layer
Move the channel (TCP_Server_Info*) selection from the tranport
layer to higher in the call stack so that:

- credit handling is done with the server that will actually be used
  to send.
  * ->wait_mtu_credit
  * ->set_credits / set_credits
  * ->add_credits / add_credits
  * add_credits_and_wake_if

- potential reconnection (smb2_reconnect) done when initializing a
  request is checked and done with the server that will actually be
  used to send.

To do this:

- remove the cifs_pick_channel() call out of compound_send_recv()

- select channel and pass it down by adding a cifs_pick_channel(ses)
  call in:
  - smb311_posix_mkdir
  - SMB2_open
  - SMB2_ioctl
  - __SMB2_close
  - query_info
  - SMB2_change_notify
  - SMB2_flush
  - smb2_async_readv  (if none provided in context param)
  - SMB2_read         (if none provided in context param)
  - smb2_async_writev (if none provided in context param)
  - SMB2_write        (if none provided in context param)
  - SMB2_query_directory
  - send_set_info
  - SMB2_oplock_break
  - SMB311_posix_qfs_info
  - SMB2_QFS_info
  - SMB2_QFS_attr
  - smb2_lockv
  - SMB2_lease_break
    - smb2_compound_op
  - smb2_set_ea
  - smb2_ioctl_query_info
  - smb2_query_dir_first
  - smb2_query_info_comound
  - smb2_query_symlink
  - cifs_writepages
  - cifs_write_from_iter
  - cifs_send_async_read
  - cifs_read
  - cifs_readpages

- add TCP_Server_Info *server param argument to:
  - cifs_send_recv
  - compound_send_recv
  - SMB2_open_init
  - SMB2_query_info_init
  - SMB2_set_info_init
  - SMB2_close_init
  - SMB2_ioctl_init
  - smb2_iotcl_req_init
  - SMB2_query_directory_init
  - SMB2_notify_init
  - SMB2_flush_init
  - build_qfs_info_req
  - smb2_hdr_assemble
  - smb2_reconnect
  - fill_small_buf
  - smb2_plain_req_init
  - __smb2_plain_req_init

The read/write codepath is different than the rest as it is using
pages, io iterators and async calls. To deal with those we add a
server pointer in the cifs_writedata/cifs_readdata/cifs_io_parms
context struct and set it in:

- cifs_writepages      (wdata)
- cifs_write_from_iter (wdata)
- cifs_readpages       (rdata)
- cifs_send_async_read (rdata)

The [rw]data->server pointer is eventually copied to
cifs_io_parms->server to pass it down to SMB2_read/SMB2_write.
If SMB2_read/SMB2_write is called from a different place that doesn't
set the server field it will pick a channel.

Some places do not pick a channel and just use ses->server or
cifs_ses_server(ses). All cifs_ses_server(ses) calls are in codepaths
involving negprot/sess.setup.

- SMB2_negotiate         (binding channel)
- SMB2_sess_alloc_buffer (binding channel)
- SMB2_echo              (uses provided one)
- SMB2_logoff            (uses master)
- SMB2_tdis              (uses master)

(list not exhaustive)

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-04 13:50:55 -05:00
Aurelien Aptel
7c06514afd cifs: multichannel: always zero struct cifs_io_parms
SMB2_read/SMB2_write check and use cifs_io_parms->server, which might
be uninitialized memory.

This change makes all callers zero-initialize the struct.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-04 13:50:55 -05:00
Kenneth D'souza
8e84a61a9c cifs: dump Security Type info in DebugData
Currently the end user is unaware with what sec type the
cifs share is mounted if no sec=<type> option is parsed.
With this patch one can easily check from DebugData.

Example:
1) Name: x.x.x.x Uses: 1 Capability: 0x8001f3fc	Session Status: 1 Security type: RawNTLMSSP

Signed-off-by: Kenneth D'souza <kdsouza@redhat.com>
Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Aurelien Aptel <aaptel@suse.com>
2020-06-04 13:50:38 -05:00
Sahitya Tummala
e78790f84a f2fs: fix retry logic in f2fs_write_cache_pages()
In case a compressed file is getting overwritten, the current retry
logic doesn't include the current page to be retried now as it sets
the new start index as 0 and new end index as writeback_index - 1.
This causes the corresponding cluster to be uncompressed and written
as normal pages without compression. Fix this by allowing writeback to
be retried for the current page as well (in case of compressed page
getting retried due to index mismatch with cluster index). So that
this cluster can be written compressed in case of overwrite.

Also, align f2fs_write_cache_pages() according to the change -
<64081362e8ff>("mm/page-writeback.c: fix range_cyclic writeback vs
writepages deadlock").

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-04 11:45:09 -07:00
David Howells
8409f67b64 afs: Adjust the fileserver rotation algorithm to reprobe/retry more quickly
Adjust the fileserver rotation algorithm so that if we've tried all the
addresses on a server (cumulatively over multiple operations) until we've
run out of untried addresses, immediately reprobe all that server's
interfaces and retry the op at least once before we move onto the next
server.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:58 +01:00
David Howells
32275d3f75 afs: Show more a bit more server state in /proc/net/afs/servers
Display more information about the state of a server record, including the
flags, rtt and break counter plus the probe state for each server in
/proc/net/afs/servers.

Rearrange the server flags a bit to make them easier to read at a glance in
the proc file.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:58 +01:00
David Howells
f3c130e6e6 afs: Don't use probe running state to make decisions outside probe code
Don't use the running state for fileserver probes to make decisions about
which server to use as the state is cleared at the start of a probe and
also intermediate values might be misleading.

Instead, add a separate 'latest known' rtt in the afs_server struct and a
flag to indicate if the server is known to be responding and update these
as and when we know what to change them to.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:58 +01:00
David Howells
f11a016a85 afs: Fix afs_statfs() to not let the values go below zero
Fix afs_statfs() so that the value for f_bavail and f_bfree don't go
"negative" if the number of blocks in use by a volume exceeds the max quota
for that volume.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:58 +01:00
David Howells
3c4c4075fc afs: Fix the by-UUID server tree to allow servers with the same UUID
Whilst it shouldn't happen, it is possible for multiple fileservers to
share a UUID, particularly if an entire cell has been duplicated, UUIDs and
all.  In such a case, it's not necessarily possible to map the effect of
the CB.InitCallBackState3 incoming RPC to a specific server unambiguously
by UUID and thus to a specific cell.

Indeed, there's a problem whereby multiple server records may need to
occupy the same spot in the rb_tree rooted in the afs_net struct.

Fix this by allowing servers to form a list, with the head of the list in
the tree.  When the front entry in the list is removed, the second in the
list just replaces it.  afs_init_callback_state() then just goes down the
line, poking each server in the list.

This means that some servers will be unnecessarily poked, unfortunately.
An alternative would be to route by call parameters.

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
2020-06-04 15:37:57 +01:00
David Howells
20325960f8 afs: Reorganise volume and server trees to be rooted on the cell
Reorganise afs_volume objects such that they're in a tree keyed on volume
ID, rooted at on an afs_cell object rather than being in multiple trees,
each of which is rooted on an afs_server object.

afs_server structs become per-cell and acquire a pointer to the cell.

The process of breaking a callback then starts with finding the server by
its network address, following that to the cell and then looking up each
volume ID in the volume tree.

This is simpler than the afs_vol_interest/afs_cb_interest N:M mapping web
and allows those structs and the code for maintaining them to be simplified
or removed.

It does make a couple of things a bit more tricky, though:

 (1) Operations now start with a volume, not a server, so there can be more
     than one answer as to whether or not the server we'll end up using
     supports the FS.InlineBulkStatus RPC.

 (2) CB RPC operations that specify the server UUID.  There's still a tree
     of servers by UUID on the afs_net struct, but the UUIDs in it aren't
     guaranteed unique.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
cca37d45d5 afs: Add a tracepoint to track the lifetime of the afs_volume struct
Add a tracepoint to track the lifetime of the afs_volume struct.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
6dfdf5369c afs: Detect cell aliases 3 - YFS Cells with a canonical cell name op
YFS Volume Location servers have an operation by which the cell name may be
queried.  Use this to find out what a YFS server thinks the canonical cell
name should be.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
6ef350b184 afs: Detect cell aliases 2 - Cells with no root volumes
Implement the second phase of cell alias detection.  This part handles
alias detection for cells that don't have root.cell volumes and so we have
to find some other volume or fileserver to query.

We take the first volume from each such cell and attempt to look it up in
the new cell.  If found, we compare the records, if they are the same, we
judge the cell names to be aliases.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
8a070a9648 afs: Detect cell aliases 1 - Cells with root volumes
Put in the first phase of cell alias detection.  This part handles alias
detection for cells that have root.cell volumes (which is expected to be
likely).

When a cell becomes newly active, it is probed for its root.cell volume,
and if it has one, this volume is compared against other root.cell volumes
to find out if the list of fileserver UUIDs have any in common - and if
that's the case, do the address lists of those fileservers have any
addresses in common.  If they do, the new cell is adjudged to be an alias
of the old cell and the old cell is used instead.

Comparing is aided by the server list in struct afs_server_list being
sorted in UUID order and the addresses in the fileserver address lists
being sorted in address order.

The cell then retains the afs_volume object for the root.cell volume, even
if it's not mounted for future alias checking.

This necessary because:

 (1) Whilst fileservers have UUIDs that are meant to be globally unique, in
     practice they are not because cells get cloned without changing the
     UUIDs - so afs_server records need to be per cell.

 (2) Sometimes the DNS is used to make cell aliases - but if we don't know
     they're the same, we may end up with multiple superblocks and multiple
     afs_server records for the same thing, impairing our ability to
     deliver callback notifications of third party changes

 (3) The fileserver RPC API doesn't contain the cell name, so it can't tell
     us which cell it's notifying and can't see that a change made to to
     one cell should notify the same client that's also accessed as the
     other cell.

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
c3e9f88826 afs: Implement client support for the YFSVL.GetCellName RPC op
Implement client support for the YFSVL.GetCellName RPC operation by which
YFS permits the canonical cell name to be queried from a VL server.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
194d28cf19 afs: Retain more of the VLDB record for alias detection
Save more bits from the volume location database record obtained for a
server so that we can use this information in cell alias detection.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
3120c170ef afs: Fix handling of CB.ProbeUuid cache manager op
The AFS filesystem driver is handling the CB.ProbeUuid request incorrectly.
The UUID presented in the request is that of the cache manager, not the
fileserver, so afs_deliver_cb_probe_uuid() shouldn't be using that UUID to
look up the server.

Fix this by looking up the server by address instead.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
44746355cc afs: Don't get epoch from a server because it may be ambiguous
Don't get the epoch from a server, particularly one that we're looking up
by UUID, as UUIDs may be ambiguous and may map to more than one server - so
we can't draw any conclusions from it.

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:56 +01:00
David Howells
e49c7b2f6d afs: Build an abstraction around an "operation" concept
Turn the afs_operation struct into the main way that most fileserver
operations are managed.  Various things are added to the struct, including
the following:

 (1) All the parameters and results of the relevant operations are moved
     into it, removing corresponding fields from the afs_call struct.
     afs_call gets a pointer to the op.

 (2) The target volume is made the main focus of the operation, rather than
     the target vnode(s), and a bunch of op->vnode->volume are made
     op->volume instead.

 (3) Two vnode records are defined (op->file[]) for the vnode(s) involved
     in most operations.  The vnode record (struct afs_vnode_param)
     contains:

	- The vnode pointer.

	- The fid of the vnode to be included in the parameters or that was
          returned in the reply (eg. FS.MakeDir).

	- The status and callback information that may be returned in the
     	  reply about the vnode.

	- Callback break and data version tracking for detecting
          simultaneous third-parth changes.

 (4) Pointers to dentries to be updated with new inodes.

 (5) An operations table pointer.  The table includes pointers to functions
     for issuing AFS and YFS-variant RPCs, handling the success and abort
     of an operation and handling post-I/O-lock local editing of a
     directory.

To make this work, the following function restructuring is made:

 (A) The rotation loop that issues calls to fileservers that can be found
     in each function that wants to issue an RPC (such as afs_mkdir()) is
     extracted out into common code, in a new file called fs_operation.c.

 (B) The rotation loops, such as the one in afs_mkdir(), are replaced with
     a much smaller piece of code that allocates an operation, sets the
     parameters and then calls out to the common code to do the actual
     work.

 (C) The code for handling the success and failure of an operation are
     moved into operation functions (as (5) above) and these are called
     from the core code at appropriate times.

 (D) The pseudo inode getting stuff used by the dynamic root code is moved
     over into dynroot.c.

 (E) struct afs_iget_data is absorbed into the operation struct and
     afs_iget() expects to be given an op pointer and a vnode record.

 (F) Point (E) doesn't work for the root dir of a volume, but we know the
     FID in advance (it's always vnode 1, unique 1), so a separate inode
     getter, afs_root_iget(), is provided to special-case that.

 (G) The inode status init/update functions now also take an op and a vnode
     record.

 (H) The RPC marshalling functions now, for the most part, just take an
     afs_operation struct as their only argument.  All the data they need
     is held there.  The result delivery functions write their answers
     there as well.

 (I) The call is attached to the operation and then the operation core does
     the waiting.

And then the new operation code is, for the moment, made to just initialise
the operation, get the appropriate vnode I/O locks and do the same rotation
loop as before.

This lays the foundation for the following changes in the future:

 (*) Overhauling the rotation (again).

 (*) Support for asynchronous I/O, where the fileserver rotation must be
     done asynchronously also.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:17 +01:00
Miklos Szeredi
74c6e384e9 ovl: make oip->index bool
ovl_get_inode() uses oip->index as a bool value, not as a pointer.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-04 10:48:19 +02:00
Miklos Szeredi
b778e1ee1a ovl: only pass ->ki_flags to ovl_iocb_to_rwf()
Next patch will want to pass a modified set of flags, so...

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-04 10:48:19 +02:00
Miklos Szeredi
df820f8de4 ovl: make private mounts longterm
Overlayfs is using clone_private_mount() to create internal mounts for
underlying layers.  These are used for operations requiring a path, such as
dentry_open().

Since these private mounts are not in any namespace they are treated as
short term, "detached" mounts and mntput() involves taking the global
mount_lock, which can result in serious cacheline pingpong.

Make these private mounts longterm instead, which trade the penalty on
mntput() for a slightly longer shutdown time due to an added RCU grace
period when putting these mounts.

Introduce a new helper kern_unmount_many() that can take care of multiple
longterm mounts with a single RCU grace period.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-04 10:48:19 +02:00
Miklos Szeredi
b8e42a651b ovl: get rid of redundant members in struct ovl_fs
ofs->upper_mnt is copied to ->layers[0].mnt and ->layers[0].trap could be
used instead of a separate ->upperdir_trap.

Split the lowerdir option early to get the number of layers, then allocate
the ->layers array, and finally fill the upper and lower layers, as before.

Get rid of path_put_init() in ovl_lower_dir(), since the only caller will
take care of that.

[Colin Ian King] Fix null pointer dereference on null stack pointer on
error return found by Coverity.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-04 10:48:19 +02:00
Miklos Szeredi
08f4c7c86d ovl: add accessor for ofs->upper_mnt
Next patch will remove ofs->upper_mnt, so add an accessor function for this
field.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-04 10:48:19 +02:00
Yuxuan Shui
520da69d26 ovl: initialize error in ovl_copy_xattr
In ovl_copy_xattr, if all the xattrs to be copied are overlayfs private
xattrs, the copy loop will terminate without assigning anything to the
error variable, thus returning an uninitialized value.

If ovl_copy_xattr is called from ovl_clear_empty, this uninitialized error
value is put into a pointer by ERR_PTR(), causing potential invalid memory
accesses down the line.

This commit initialize error with 0. This is the correct value because when
there's no xattr to copy, because all xattrs are private, ovl_copy_xattr
should succeed.

This bug is discovered with the help of INIT_STACK_ALL and clang.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Link: https://bugs.chromium.org/p/chromium/issues/detail?id=1050405
Fixes: 0956254a2d ("ovl: don't copy up opaqueness")
Cc: stable@vger.kernel.org # v4.8
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-04 10:48:19 +02:00
Steve French
e80ddeb2f7 smb3: fix incorrect number of credits when ioctl MaxOutputResponse > 64K
We were not checking to see if ioctl requests asked for more than
64K (ie when CIFSMaxBufSize was > 64K) so when setting larger
CIFSMaxBufSize then ioctls would fail with invalid parameter errors.
When requests ask for more than 64K in MaxOutputResponse then we
need to ask for more than 1 credit.

Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-06-04 01:13:41 -05:00
Steve French
1ee0e6d47d smb3: default to minimum of two channels when multichannel specified
When "multichannel" is specified on mount, make sure to default to
at least two channels.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-06-04 01:13:37 -05:00
Linus Torvalds
ee01c4d72a Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "More mm/ work, plenty more to come

  Subsystems affected by this patch series: slub, memcg, gup, kasan,
  pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
  thp, mmap, kconfig"

* akpm: (131 commits)
  arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
  x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
  riscv: support DEBUG_WX
  mm: add DEBUG_WX support
  drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
  mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
  powerpc/mm: drop platform defined pmd_mknotpresent()
  mm: thp: don't need to drain lru cache when splitting and mlocking THP
  hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
  sparc32: register memory occupied by kernel as memblock.memory
  include/linux/memblock.h: fix minor typo and unclear comment
  mm, mempolicy: fix up gup usage in lookup_node
  tools/vm/page_owner_sort.c: filter out unneeded line
  mm: swap: memcg: fix memcg stats for huge pages
  mm: swap: fix vmstats for huge pages
  mm: vmscan: limit the range of LRU type balancing
  mm: vmscan: reclaim writepage is IO cost
  mm: vmscan: determine anon/file pressure balance at the reclaim root
  mm: balance LRU lists based on relative thrashing
  mm: only count actual rotations as LRU reclaim cost
  ...
2020-06-03 20:24:15 -07:00
Jan Kara
6b8ed62008 ext4: avoid unnecessary transaction starts during writeback
ext4_writepages() currently works in a loop like:
  start a transaction
  scan inode for pages to write
  map and submit these pages
  stop the transaction

This loop results in starting transaction once more than is needed
because in the last iteration we start a transaction only to scan the
inode and find there are no pages to write. This can be significant
increase in number of transaction starts for single-extent files or
files that have all blocks already mapped. Furthermore we already know
from previous iteration whether there are more pages to write or not. So
propagate the information from mpage_prepare_extent_to_map() and avoid
unnecessary looping in case there are no more pages to write.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200525081215.29451-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:56 -04:00
Jens Axboe
6e014c621e ext4: don't block for O_DIRECT if IOCB_NOWAIT is set
Running with some debug patches to detect illegal blocking triggered the
extend/unaligned condition in ext4. If ext4 needs to extend the file (and
hence go to buffered IO), or if the app is doing unaligned IO, then ext4
asks the iomap code to wait for IO completion. If the caller asked for
no-wait semantics by setting IOCB_NOWAIT, then ext4 should return -EAGAIN
instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Link: https://lore.kernel.org/r/76152096-2bbb-7682-8fce-4cb498bcd909@kernel.dk
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Christoph Hellwig
ba98890393 ext4: remove the access_ok() check in ext4_ioctl_get_es_cache
access_ok just checks we are fed a proper user pointer.  We also do that
in copy_to_user itself, so no need to do this early.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200523073016.2944131-10-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Christoph Hellwig
c7d216e8c4 fs: remove the access_ok() check in ioctl_fiemap
access_ok just checks we are fed a proper user pointer.  We also do that
in copy_to_user itself, so no need to do this early.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-9-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Christoph Hellwig
45dd052e67 fs: handle FIEMAP_FLAG_SYNC in fiemap_prep
By moving FIEMAP_FLAG_SYNC handling to fiemap_prep we ensure it is
handled once instead of duplicated, but can still be done under fs locks,
like xfs/iomap intended with its duplicate handling.  Also make sure the
error value of filemap_write_and_wait is propagated to user space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-8-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Christoph Hellwig
cddf8a2c4a fs: move fiemap range validation into the file systems instances
Replace fiemap_check_flags with a fiemap_prep helper that also takes the
inode and mapped range, and performs the sanity check and truncation
previously done in fiemap_check_range.  This way the validation is inside
the file system itself and thus properly works for the stacked overlayfs
case as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-7-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Christoph Hellwig
2732881894 iomap: fix the iomap_fiemap prototype
iomap_fiemap should take u64 start and len arguments, just like the
->fiemap prototype.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-6-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Christoph Hellwig
10c5db2864 fs: move the fiemap definitions out of fs.h
No need to pull the fiemap definitions into almost every file in the
kernel build.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-5-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Christoph Hellwig
44ebcd06bb fs: mark __generic_block_fiemap static
There is no caller left outside of ioctl.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-4-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:54 -04:00
Christoph Hellwig
da565e792b ext4: remove the call to fiemap_check_flags in ext4_fiemap
iomap_fiemap already calls fiemap_check_flags first thing, so this
additional check is redundant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200523073016.2944131-3-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:54 -04:00
Christoph Hellwig
03a5ed24c9 ext4: split _ext4_fiemap
The fiemap and EXT4_IOC_GET_ES_CACHE cases share almost no code, so split
them into entirely separate functions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200523073016.2944131-2-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:54 -04:00
Christoph Hellwig
328e24ae14 ext4: fix fiemap size checks for bitmap files
Add an extra validation of the len parameter, as for ext4 some files
might have smaller file size limits than others.  This also means the
redundant size check in ext4_ioctl_get_es_cache can go away, as all
size checking is done in the shared fiemap handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200505154324.3226743-3-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:54 -04:00
Ritesh Harjani
175efa81fe ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
ext4 supports max number of logical blocks in a file to be 0xffffffff.
(This is since ext4_extent's ee_block is __le32).
This means that EXT4_MAX_LOGICAL_BLOCK should be 0xfffffffe (starting
from 0 logical offset). This patch fixes this.

The issue was seen when ext4 moved to iomap_fiemap API and when
overlayfs was mounted on top of ext4. Since overlayfs was missing
filemap_check_ranges(), so it could pass a arbitrary huge length which
lead to overflow of map.m_len logic.

This patch fixes that.

Fixes: d3b6f23f71 ("ext4: move ext4_fiemap to use iomap framework")
Reported-by: syzbot+77fa5bdb65cc39711820@syzkaller.appspotmail.com
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20200505154324.3226743-2-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:54 -04:00
Jonathan Grant
9f364e1d95 add comment for ext4_dir_entry_2 file_type member
Signed-off-by: Jonathan Grant <jg@jguk.org>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/ad3290d5-86af-99c1-f9d5-cd1bab710429@jguk.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:54 -04:00
Jan Kara
14ff628630 jbd2: avoid leaking transaction credits when unreserving handle
When reserved transaction handle is unused, we subtract its reserved
credits in __jbd2_journal_unreserve_handle() called from
jbd2_journal_stop(). However this function forgets to remove reserved
credits from transaction->t_outstanding_credits and thus the transaction
space that was reserved remains effectively leaked. The leaked
transaction space can be quite significant in some cases and leads to
unnecessarily small transactions and thus reducing throughput of the
journalling machinery. E.g. fsmark workload creating lots of 4k files
was observed to have about 20% lower throughput due to this when ext4 is
mounted with dioread_nolock mount option.

Subtract reserved credits from t_outstanding_credits as well.

CC: stable@vger.kernel.org
Fixes: 8f7d89f368 ("jbd2: transaction reservation support")
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200520133119.1383-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Jan Kara
dfcd4489e2 ext4: drop ext4_journal_free_reserved()
Remove ext4_journal_free_reserved() function. It is never used.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20200520133119.1383-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
993778306e ext4: mballoc: use lock for checking free blocks while retrying
Currently while doing block allocation grp->bb_free may be getting
modified if discard is happening in parallel.
For e.g. consider a case where there are lot of threads who have
preallocated lot of blocks and there is a thread which is trying
to discard all of this group's PA. Now it could happen that
we see all of those group's bb_free is zero and fail the allocation
while there is sufficient space if we free up all the PA.

So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set
if we are unable to allocate any blocks in the first try (since we may
not have considered blocks about to be discarded from PA lists).
So during retry attempt to allocate blocks we will use ext4_lock_group()
for checking if the group is good or not.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/9cb740a117c958c36596f167b12af1beae9a68b7.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
8ef123fe02 ext4: mballoc: refactor ext4_mb_good_group()
ext4_mb_good_group() definition was changed some time back
and now it even initializes the buddy cache (via ext4_mb_init_group()),
if in case the EXT4_MB_GRP_NEED_INIT() is true for a group.
Note that ext4_mb_init_group() could sleep and so should not be called
under a spinlock held.
This is fine as of now because ext4_mb_good_group() is called before
loading the buddy bitmap without ext4_lock_group() held
and again called after loading the bitmap, only this time with
ext4_lock_group() held.
But still this whole thing is confusing.

So this patch refactors out ext4_mb_good_group_nolock() which should be
called when without holding ext4_lock_group().
Also in further patches we hold the spinlock (ext4_lock_group()) while
doing any calculations which involves grp->bb_free or grp->bb_fragments.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/d9f7d031a5fbe1c943fae6bf1ff5cdf0604ae722.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
07b5b8e1ac ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
There could be a race in function ext4_mb_discard_group_preallocations()
where the 1st thread may iterate through group's bb_prealloc_list and
remove all the PAs and add to function's local list head.
Now if the 2nd thread comes in to discard the group preallocations,
it will see that the group->bb_prealloc_list is empty and will return 0.

Consider for a case where we have less number of groups
(for e.g. just group 0),
this may even return an -ENOSPC error from ext4_mb_new_blocks()
(where we call for ext4_mb_discard_group_preallocations()).
But that is wrong, since 2nd thread should have waited for 1st thread
to release all the PAs and should have retried for allocation.
Since 1st thread was anyway going to discard the PAs.

The algorithm using this percpu seq counter goes below:
1. We sample the percpu discard_pa_seq counter before trying for block
   allocation in ext4_mb_new_blocks().
2. We increment this percpu discard_pa_seq counter when we either allocate
   or free these blocks i.e. while marking those blocks as used/free in
   mb_mark_used()/mb_free_blocks().
3. We also increment this percpu seq counter when we successfully identify
   that the bb_prealloc_list is not empty and hence proceed for discarding
   of those PAs inside ext4_mb_discard_group_preallocations().

Now to make sure that the regular fast path of block allocation is not
affected, as a small optimization we only sample the percpu seq counter
on that cpu. Only when the block allocation fails and when freed blocks
found were 0, that is when we sample percpu seq counter for all cpus using
below function ext4_get_discard_pa_seq_sum(). This happens after making
sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.

It can be well argued that why don't just check for grp->bb_free to
see if there are any free blocks to be allocated. So here are the two
concerns which were discussed:-

1. If for some reason the blocks available in the group are not
   appropriate for allocation logic (say for e.g.
   EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
   the retry logic may result into infinte looping since grp->bb_free is
   non-zero.

2. Also before preallocation was clubbed with block allocation with the
   same ext4_lock_group() held, there were lot of races where grp->bb_free
   could not be reliably relied upon.
Due to above, this patch considers discard_pa_seq logic to determine if
we should retry for block allocation. Say if there are are n threads
trying for block allocation and none of those could allocate or discard
any of the blocks, then all of those n threads will fail the block
allocation and return -ENOSPC error. (Since the seq counter for all of
those will match as no block allocation/discard was done during that
duration).

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/7f254686903b87c419d798742fd9a1be34f0657b.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
cf5e2ca6c9 ext4: mballoc: refactor ext4_mb_discard_preallocations()
Implement ext4_mb_discard_preallocations_should_retry()
which we will need in later patches to add more logic
like check for sequence number match to see if we should
retry for block allocation or not.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1cfae0098d2aa9afbeb59331401258182868c8f2.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
53f86b170d ext4: mballoc: add blocks to PA list under same spinlock after allocating blocks
ext4_mb_discard_preallocations() only checks for grp->bb_prealloc_list
of every group to discard the group's PA to free up the space if
allocation request fails. Consider below race:-

Process A  				Process B

1. allocate blocks
					1. Fails block allocation from
					     ext4_mb_regular_allocator()
   ext4_lock_group()
	allocated blocks
	more than ac_o_ex.fe_len
   ext4_unlock_group()
					2. Scans the
					   grp->bb_prealloc_list (under
					   ext4_lock_group()) and
					   find nothing and thus return
					   -ENOSPC.

2. Add the additional blocks to PA list

   ext4_lock_group()
   	add blocks to grp->bb_prealloc_list
   ext4_unlock_group()

Above race could be avoided if we add those additional blocks to
grp->bb_prealloc_list at the same time with block allocation when
ext4_lock_group() was still held.
With this discard-PA will know if there are actually any blocks which
could be freed from the PA

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/a2217dd782585b42328981832e6d396abaaccb80.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Eric Biggers
de8ff14cab ext4: add casefold flag to EXT4_INODE_* flags
No one currently needs EXT4_INODE_CASEFOLD, but add it to keep the
EXT4_INODE_* definitions in sync with the EXT4_*_FL definitions.

Also make it clearer that the casefold flag is only for directories.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20200510215252.87833-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Eric Whitney
8ad8d71003 ext4: rework map struct instantiation in ext4_ext_map_blocks()
The path performing block allocations in ext4_ext_map_blocks() contains
code trimming the length of a new extent that is repeated later
in the function.  This code is both redundant and unnecessary as the
exact length of the new extent has already been calculated.  Rewrite the
instantiation of the map struct in this case to use the available
values, avoiding the overhead of unnecessary conversions and improving
clarity.  Add another map struct instantiation tailored specifically to
the separate case for an existing written extent.  Remove an old comment
that no longer appears applicable to the current code.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200510155805.18808-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
70aa1554b0 ext4: make ext_debug() implementation to use pr_debug()
ext_debug() msgs could be helpful, provided those could be enabled
without recompiling kernel and also if we could selectively enable
only required prints for case by case debugging.

So make ext_debug() implementation use pr_debug().
Also change ext_debug() to be defined with CONFIG_EXT4_DEBUG.
So EXT_DEBUG macro now mostly remain for below 3 functions.
ext4_ext_show_path/leaf/move() (whose print msgs use ext_debug()
which again could be dynamically enabled using pr_debug())

This also changes the ext_debug() to take inode as a parameter
to add inode no. in all of it's msgs.
Prints additional info like process name / pid, superblock id etc.
This also removes any explicit function names passed in ext_debug().
Since ext_debug() on it's own prints file, func and line no.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/d31dc189b0aeda9384fe7665e36da7cd8c61571f.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
d3df14535f ext4: mballoc: make mb_debug() implementation to use pr_debug()
mb_debug() msg had only 1 control level for all type of msgs.
And if we enable mballoc_debug then all of those msgs would be enabled.
Instead of adding multiple debug levels for mb_debug() msgs, use
pr_debug() with which we could have finer control to print msgs at all
of different levels (i.e. at file, func, line no.).

Also add process name/pid, superblk id, and other info in mb_debug()
msg. This also kills the mballoc_debug module parameter, since it is
not needed any more.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/f0c660cbde9e2edbe95c67942ca9ad80dd2231eb.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
8ec2d31b27 ext4: replace EXT_DEBUG with __maybe_unused in ext4_ext_handle_unwritten_extents()
Replace EXT_DEBUG with __maybe_unused from inside
ext4_ext_handle_unwritten_extents() function.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/ae335b94506cd9db9d2648c1f4dd25a80f9f3ce2.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
ec8c60be96 ext4: improve ext_debug() msg in case of block allocation failure
ext4_map_blocks() has ext_debug msg early at the start of function.
We also get ext_debug msg if we could allocate a block from
ext4_ext_map_blocks(). But there is no ext_debug() msg in case of
block allocation failure. So add one along with error code.

Also add more info in ext_debug() msg like how many blocks were allocated
v/s how many were requested in ext4_ext_map_blocks().

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1610ec2aa932396be00f9d552fe29da473ead176.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
6db0746189 ext4: use BIT() macro for BH_** state bits
Simply use BIT() macro for all BH_** state bits instead of open
coding it.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/57667689f51a3f9dba2fcef7d3425187fa3ba69f.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
9bee5779ee ext4: balloc: use task_pid_nr() helper
Use task_pid_nr() function instead of current->pid.
There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/4b58403e15e9c8deb34a1b93deb3fc9cd153ab84.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
eb2b8ebb86 ext4: mballoc: fix possible NULL ptr & remove BUG_ONs from DOUBLE_CHECK
Make sure to check for e4b->bd_info->bb_bitmap == NULL, in
mb_cmp_bitmaps() and return if NULL, to avoid possible NULL ptr
dereference. Similar to how we do this in other ifdef DOUBLE_CHECK
functions.

Also remove the BUG_ON() logic if kmalloc() or ext4_read_block_bitmap()
fails. We should simply mark grp->bb_bitmap as NULL if above happens.
In fact ext4_read_block_bitmap() may even return an error in case of resize
ioctl. Hence remove this BUG_ON logic (fstests ext4/032 may trigger
this).

Link: https://lore.kernel.org/r/9a54f8a696ff17c057cd571be3d15ac3ec1407f1.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
a345021553 ext4: mballoc: refactor code inside DOUBLE_CHECK into separate function
This patch implemets mb_group_bb_bitmap_alloc() and
mb_group_bb_bitmap_free() function to remove #ifdef DOUBLE_CHECK macro
and it's related code from inside
ext4_mb_add_groupinfo()/ext4_mb_release().

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8c2095d74b779f0254a19b24982490dc6f07c4f9.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
4fca8f0779 ext4: mballoc: make ext4_mb_use_preallocated() return type as bool
Change return type of function ext4_mb_use_preallocated() to bool to
better reflect what this function can return.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/7880cb6ef911465beafefcd7e9c3ea214688744b.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
f283529aba ext4: mballoc: simplify error handling in ext4_init_mballoc()
This patch simplifies error handling logic in ext4_init_mballoc(),
by adding all the cleanups at one place at the end of that function.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8621a7bc68f7107a9ac4292afeb784515333bd25.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
004379d0b0 ext4: mballoc: fix few other format specifier in mb_debug()
Fix few other format specifiers in mb_debug() msgs.
As such no other functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/574fa7f833abf2dbf3b53a2fea3195e71f6cdbd8.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
36bad4233c ext4: mballoc: correct the mb_debug() format specifier for pa_len var
pa->pa_len is an integer. Fix all of the format specifier used in
mb_debug() for pa_len to %d instead of %u.

As such no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/af4987f643c586f62bcc9961e43f0a67151d5551.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
bbc4ec77e9 ext4: mballoc: add more mb_debug() msgs
This patch adds some more debugging mb_debug() msgs to help improve
mballoc code debugging.
Other than adding more mb_debug() msgs at few more places,
there should be no other functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/5fc8e7788b924e211fcfa4a4c1d2f8503511661a.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
e68cf40c0d ext4: mballoc: refactor ext4_mb_show_ac()
This factors out ext4_mb_show_pa() function to show all the group's
preallocation info. This could be useful info to be added in later
patches.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8f07d890b0038dcc935e9c10e6043ec9f3792721.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
212da3ec6f ext4: mballoc: print bb_free info even when it is 0
Improve the debugging msg by also printing even if bb_free is 0.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/c894f1d1d30f86ae38f4e3a861949665b6dc61cd.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Theodore Ts'o
73c384c0cd ext4: avoid ext4_error()'s caused by ENOMEM in the truncate path
We can't fail in the truncate path without requiring an fsck.
Add work around for this by using a combination of retry loops
and the __GFP_NOFAIL flag.

From: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Anna Pendleton <pendleton@google.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200507175028.15061-1-pendleton@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Eric Biggers
08adf452e6 ext4: fix race between ext4_sync_parent() and rename()
'igrab(d_inode(dentry->d_parent))' without holding dentry->d_lock is
broken because without d_lock, d_parent can be concurrently changed due
to a rename().  Then if the old directory is immediately deleted, old
d_parent->inode can be NULL.  That causes a NULL dereference in igrab().

To fix this, use dget_parent() to safely grab a reference to the parent
dentry, which pins the inode.  This also eliminates the need to use
d_find_any_alias() other than for the initial inode, as we no longer
throw away the dentry at each step.

This is an extremely hard race to hit, but it is possible.  Adding a
udelay() in between the reads of ->d_parent and its ->d_inode makes it
reproducible on a no-journal filesystem using the following program:

    #include <fcntl.h>
    #include <unistd.h>

    int main()
    {
        if (fork()) {
            for (;;) {
                mkdir("dir1", 0700);
                int fd = open("dir1/file", O_RDWR|O_CREAT|O_SYNC);
                write(fd, "X", 1);
                close(fd);
            }
        } else {
            mkdir("dir2", 0700);
            for (;;) {
                rename("dir1/file", "dir2/file");
                rmdir("dir1");
            }
        }
    }

Fixes: d59729f4e7 ("ext4: fix races in ext4_sync_parent()")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20200506183140.541194-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Christophe JAILLET
80dd4978dd ext4: fix a typo in a comment
s/extnets/extents/

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/20200503200647.154701-1-christophe.jaillet@wanadoo.fr
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:50 -04:00
Eric Whitney
be809e1274 ext4: clean up ext4_ext_convert_to_initialized() error handling
If ext4_ext_convert_to_initialized() fails when called within
ext4_ext_handle_unwritten_extents(), immediately error out through the
exit point at function end.  Fix the error handling in the event
ext4_ext_convert_to_initialized() returns 0, which it shouldn't do when
converting an existing extent.  The current code returns the passed in
value of allocated (which is likely non-zero) while failing to set
m_flags, m_pblk, and m_len.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200430185320.23001-5-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:50 -04:00
Eric Whitney
779e26517b ext4: clean up GET_BLOCKS_PRE_IO error handling
If the call to ext4_split_convert_extents() fails in the
EXT4_GET_BLOCKS_PRE_IO case within ext4_ext_handle_unwritten_extents(),
error out through the exit point at function end rather than jumping
through an intermediate point.  Fix the error handling in the event
ext4_split_convert_extents() returns 0, which it shouldn't do when
splitting an existing extent.  The current code returns the passed in
value of allocated (which is likely non-zero) while failing to set
m_flags, m_pblk, and m_len.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200430185320.23001-4-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:50 -04:00
Eric Whitney
bee6cf00c7 ext4: remove redundant GET_BLOCKS_CONVERT code
Remove the redundant code assigning values to ext4_map_blocks components
in ext4_ext_handle_unwritten_extents() for the EXT4_GET_BLOCKS_CONVERT
case, using the code at the function exit instead.  Clean up and reorder
that code to eliminate more redundancy and improve readability.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200430185320.23001-3-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:50 -04:00
Eric Whitney
ee802f8689 ext4: remove dead GET_BLOCKS_ZERO code
There's no call to ext4_map_blocks() in the current ext4 code with a
flags argument that combines EXT4_GET_BLOCKS_CONVERT and
EXT4_GET_BLOCKS_ZERO.  Remove the code that corresponds to this case
from ext4_ext_handle_unwritten_extents().

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200430185320.23001-2-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:50 -04:00
Harshad Shirwadkar
b60ca3343e ext4: don't ignore return values from ext4_ext_dirty()
Don't ignore return values from ext4_ext_dirty, since the errors
indicate valid failures below Ext4.  In all of the other instances of
ext4_ext_dirty calls, the error return value is handled in some
way. This patch makes those remaining couple of places to handle
ext4_ext_dirty errors as well. In case of ext4_split_extent_at(), the
ignorance of return value is intentional. The reason is that we are
already in error path and there isn't much we can do if ext4_ext_dirty
returns error. This patch adds a comment for that case explaining why
we ignore the return value.

In the longer run, we probably should
make sure that errors from other mark_dirty routines are handled as
well.

Ran gce-xfstests smoke tests and verified that there were no
regressions.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200427013438.219117-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:50 -04:00
Harshad Shirwadkar
4209ae12b1 ext4: handle ext4_mark_inode_dirty errors
ext4_mark_inode_dirty() can fail for real reasons. Ignoring its return
value may lead ext4 to ignore real failures that would result in
corruption / crashes. Harden ext4_mark_inode_dirty error paths to fail
as soon as possible and return errors to the caller whenever
appropriate.

One of the possible scnearios when this bug could affected is that
while creating a new inode, its directory entry gets added
successfully but while writing the inode itself mark_inode_dirty
returns error which is ignored. This would result in inconsistency
that the directory entry points to a non-existent inode.

Ran gce-xfstests smoke tests and verified that there were no
regressions.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200427013438.219117-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:50 -04:00
Jeffle Xu
8418897f1b ext4: fix error pointer dereference
Don't pass error pointers to brelse().

commit 7159a986b4 ("ext4: fix some error pointer dereferences") has fixed
some cases, fix the remaining one case.

Once ext4_xattr_block_find()->ext4_sb_bread() failed, error pointer is
stored in @bs->bh, which will be passed to brelse() in the cleanup
routine of ext4_xattr_set_handle(). This will then cause a NULL panic
crash in __brelse().

BUG: unable to handle kernel NULL pointer dereference at 000000000000005b
RIP: 0010:__brelse+0x1b/0x50
Call Trace:
 ext4_xattr_set_handle+0x163/0x5d0
 ext4_xattr_set+0x95/0x110
 __vfs_setxattr+0x6b/0x80
 __vfs_setxattr_noperm+0x68/0x1b0
 vfs_setxattr+0xa0/0xb0
 setxattr+0x12c/0x1a0
 path_setxattr+0x8d/0xc0
 __x64_sys_setxattr+0x27/0x30
 do_syscall_64+0x60/0x250
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

In this case, @bs->bh stores '-EIO' actually.

Fixes: fb265c9cb4 ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: stable@kernel.org # 2.6.19
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1587628004-95123-1-git-send-email-jefflexu@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:49 -04:00
Jan Kara
ceff86fdda ext4: Avoid freeing inodes on dirty list
When we are evicting inode with journalled data, we may race with
transaction commit in the following way:

CPU0					CPU1
jbd2_journal_commit_transaction()	evict(inode)
					  inode_io_list_del()
					  inode_wait_for_writeback()
  process BJ_Forget list
    __jbd2_journal_insert_checkpoint()
    __jbd2_journal_refile_buffer()
      __jbd2_journal_unfile_buffer()
        if (test_clear_buffer_jbddirty(bh))
          mark_buffer_dirty(bh)
	    __mark_inode_dirty(inode)
					  ext4_evict_inode(inode)
					    frees the inode

This results in use-after-free issues in the writeback code (or
the assertion added in the previous commit triggering).

Fix the problem by removing inode from writeback lists once all the page
cache is evicted and so inode cannot be added to writeback lists again.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200421085445.5731-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:49 -04:00
Jan Kara
4301efa4c7 writeback: Export inode_io_list_del()
Ext4 needs to remove inode from writeback lists after it is out of
visibility of its journalling machinery (which can still dirty the
inode). Export inode_io_list_del() for it.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200421085445.5731-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:49 -04:00
Xiyu Yang
3bbd0ef260 ext4: fix buffer_head refcnt leak when ext4_iget() fails
ext4_orphan_get() invokes ext4_read_inode_bitmap(), which returns a
reference of the specified buffer_head object to "bitmap_bh" with
increased refcnt.

When ext4_orphan_get() returns, local variable "bitmap_bh" becomes
invalid, so the refcount should be decreased to keep refcount balanced.

The reference counting issue happens in one exception handling path of
ext4_orphan_get(). When ext4_iget() fails, the function forgets to
decrease the refcnt increased by ext4_read_inode_bitmap(), causing a
refcnt leak.

Fix this issue by calling brelse() when ext4_iget() fails.

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/1587618568-13418-1-git-send-email-xiyuyang19@fudan.edu.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:49 -04:00
Harshad Shirwadkar
c36a71b4e3 ext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_max
If eh->eh_max is 0, EXT_MAX_EXTENT/INDEX would evaluate to unsigned
(-1) resulting in illegal memory accesses. Although there is no
consistent repro, we see that generic/019 sometimes crashes because of
this bug.

Ran gce-xfstests smoke and verified that there were no regressions.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200421023959.20879-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2020-06-03 23:16:49 -04:00
Jason Yan
39c0ae163f ext4: remove unnecessary comparisons to bool
Fix the following coccicheck warning:

fs/ext4/extents_status.c:1057:5-28: WARNING: Comparison to bool
fs/ext4/inode.c:2314:18-24: WARNING: Comparison to bool

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200420042918.19459-1-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:49 -04:00
Eric Whitney
9e52484c71 ext4: remove EXT4_GET_BLOCKS_KEEP_SIZE flag
The eofblocks code was removed in the 5.7 release by "ext4: remove
EOFBLOCKS_FL and associated code" (4337ecd1fe).  The ext4_map_blocks()
flag used to trigger it can now be removed as well.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200415203140.30349-2-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:48 -04:00
Carlos Guerrero Álvarez
6b6aeffc93 ext4: fix a style issue in fs/ext4/acl.c
Fixed an if statement where braces were not needed.

Link: https://lore.kernel.org/r/20200416141456.1089-1-carlosteniswarrior@gmail.com
Signed-off-by: Carlos Guerrero Álvarez <carlosteniswarrior@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
2020-06-03 23:16:48 -04:00
Shijie Hu
8859025315 hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
In a 32-bit program, running on arm64 architecture.  When the address
space below mmap base is completely exhausted, shmat() for huge pages will
return ENOMEM, but shmat() for normal pages can still success on no-legacy
mode.  This seems not fair.

For normal pages, the calling trace of get_unmapped_area() is:

	=> mm->get_unmapped_area()
	if on legacy mode,
		=> arch_get_unmapped_area()
			=> vm_unmapped_area()
	if on no-legacy mode,
		=> arch_get_unmapped_area_topdown()
			=> vm_unmapped_area()

For huge pages, the calling trace of get_unmapped_area() is:

	=> file->f_op->get_unmapped_area()
		=> hugetlb_get_unmapped_area()
			=> vm_unmapped_area()

To solve this issue, we only need to make hugetlb_get_unmapped_area() take
the same way as mm->get_unmapped_area().  Add *bottomup() and *topdown()
for hugetlbfs, and check current mm->get_unmapped_area() to decide which
one to use.  If mm->get_unmapped_area is equal to
arch_get_unmapped_area_topdown(), hugetlb_get_unmapped_area() calls
topdown routine, otherwise calls bottomup routine.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Shijie Hu <hushijie3@huawei.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Will Deacon <will@kernel.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: yangerkun <yangerkun@huawei.com>
Cc: ChenGang <cg.chen@huawei.com>
Cc: Chen Jie <chenjie6@huawei.com>
Link: http://lkml.kernel.org/r/20200518065338.113664-1-hushijie3@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:49 -07:00
Johannes Weiner
6058eaec81 mm: fold and remove lru_cache_add_anon() and lru_cache_add_file()
They're the same function, and for the purpose of all callers they are
equivalent to lru_cache_add().

[akpm@linux-foundation.org: fix it for local_lock changes]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:48 -07:00
Linus Torvalds
cb8e59cc87 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

 2) Add GSO partial support to igc, from Sasha Neftin.

 3) Several cleanups and improvements to r8169 from Heiner Kallweit.

 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

 5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

 8) Add sriov and vf support to hinic, from Luo bin.

 9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
  selftests: net: ip_defrag: ignore EPERM
  net_failover: fixed rollback in net_failover_open()
  Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
  Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
  vmxnet3: allow rx flow hash ops only when rss is enabled
  hinic: add set_channels ethtool_ops support
  selftests/bpf: Add a default $(CXX) value
  tools/bpf: Don't use $(COMPILE.c)
  bpf, selftests: Use bpf_probe_read_kernel
  s390/bpf: Use bcr 0,%0 as tail call nop filler
  s390/bpf: Maintain 8-byte stack alignment
  selftests/bpf: Fix verifier test
  selftests/bpf: Fix sample_cnt shared between two threads
  bpf, selftests: Adapt cls_redirect to call csum_level helper
  bpf: Add csum_level helper for fixing up csum levels
  bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
  sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
  crypto/chtls: IPv6 support for inline TLS
  Crypto/chcr: Fixes a coccinile check error
  Crypto/chcr: Fixes compilations warnings
  ...
2020-06-03 16:27:18 -07:00
Linus Torvalds
ae03c53d00 Merge branch 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull splice updates from Al Viro:
 "Christoph's assorted splice cleanups"

* 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: rename pipe_buf ->steal to ->try_steal
  fs: make the pipe_buf_operations ->confirm operation optional
  fs: make the pipe_buf_operations ->steal operation optional
  trace: remove tracing_pipe_buf_ops
  pipe: merge anon_pipe_buf*_ops
  fs: simplify do_splice_from
  fs: simplify do_splice_to
2020-06-03 15:52:19 -07:00
Al Viro
8861fd576e binfmt_flat: don't use __put_user()
... and check the return value

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-03 16:58:53 -04:00
Al Viro
0abb013e2e binfmt_elf_fdpic: don't use __... uaccess primitives
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-03 16:58:11 -04:00
Al Viro
646e84deb4 binfmt_elf: don't bother with __{put,copy_to}_user()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-03 16:56:47 -04:00
Linus Torvalds
e7c93cbfe9 threads-v5.8
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXtYhfgAKCRCRxhvAZXjc
 oghSAP9uVX3vxYtEtNvu9WtEn1uYZcSKZoF1YrcgY7UfSmna0gEAruzyZcai4CJL
 WKv+4aRq2oYk+hsqZDycAxIsEgWvNg8=
 =ZWj3
 -----END PGP SIGNATURE-----

Merge tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread updates from Christian Brauner:
 "We have been discussing using pidfds to attach to namespaces for quite
  a while and the patches have in one form or another already existed
  for about a year. But I wanted to wait to see how the general api
  would be received and adopted.

  This contains the changes to make it possible to use pidfds to attach
  to the namespaces of a process, i.e. they can be passed as the first
  argument to the setns() syscall.

  When only a single namespace type is specified the semantics are
  equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
  equals setns(pidfd, CLONE_NEWNET).

  However, when a pidfd is passed, multiple namespace flags can be
  specified in the second setns() argument and setns() will attach the
  caller to all the specified namespaces all at once or to none of them.

  Specifying 0 is not valid together with a pidfd. Here are just two
  obvious examples:

    setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
    setns(pidfd, CLONE_NEWUSER);

  Allowing to also attach subsets of namespaces supports various
  use-cases where callers setns to a subset of namespaces to retain
  privilege, perform an action and then re-attach another subset of
  namespaces.

  Apart from significantly reducing the number of syscalls needed to
  attach to all currently supported namespaces (eight "open+setns"
  sequences vs just a single "setns()"), this also allows atomic setns
  to a set of namespaces, i.e. either attaching to all namespaces
  succeeds or we fail without having changed anything.

  This is centered around a new internal struct nsset which holds all
  information necessary for a task to switch to a new set of namespaces
  atomically. Fwiw, with this change a pidfd becomes the only token
  needed to interact with a container. I'm expecting this to be
  picked-up by util-linux for nsenter rather soon.

  Associated with this change is a shiny new test-suite dedicated to
  setns() (for pidfds and nsfds alike)"

* tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  selftests/pidfd: add pidfd setns tests
  nsproxy: attach to namespaces via pidfds
  nsproxy: add struct nsset
2020-06-03 13:12:57 -07:00
Chengguang Xu
1434a65ea6 ovl: drop negative dentry in upper layer
Negative dentries of upper layer are useless after construction of
overlayfs' own dentry and may keep in the memory long time even after
unmount of overlayfs instance. This patch tries to drop unnecessary
negative dentry of upper layer to effectively reclaim memory.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-03 09:45:22 +02:00
Miklos Szeredi
05acefb487 ovl: check permission to open real file
Call inode_permission() on real inode before opening regular file on one of
the underlying layers.

In some cases ovl_permission() already checks access to an underlying file,
but it misses the metacopy case, and possibly other ones as well.

Removing the redundant permission check from ovl_permission() should be
considered later.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-03 09:45:22 +02:00
Miklos Szeredi
292f902a40 ovl: call secutiry hook in ovl_real_ioctl()
Verify LSM permissions for underlying file, since vfs_ioctl() doesn't do
it.

[Stephen Rothwell] export security_file_ioctl

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-03 09:45:18 +02:00
Linus Torvalds
d6f9469a03 Changes since last update:
- Convert to use the new mount apis;
 
  - Some random cleanup patches.
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYIADMWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXtbfOhUcaHNpYW5na2Fv
 QHJlZGhhdC5jb20ACgkQOTcx3B+15gTvZgD6Ap8mYxRaW7Qta+HEyFuyRrxWZ/XZ
 pq/hYiouGosDdaMBAOUNl8pGlPX54T+Y9VZv0wV0Dp4pan6NApdgtL9fIQUE
 =QhQh
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs updates from Gao Xiang:
 "The most interesting part is the new mount api conversion, which is
  actually a old patch already pending for several cycles. And the
  others are recent trivial cleanups here.

  Summary:

   - Convert to use the new mount apis

   - Some random cleanup patches"

* tag 'erofs-for-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: suppress false positive last_block warning
  erofs: convert to use the new mount fs_context api
  erofs: code cleanup by removing ifdef macro surrounding
2020-06-02 20:16:55 -07:00
Linus Torvalds
cadf32234b Replace zero-length array in JFS
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAl7WbbAACgkQNqiEXrVA
 jGRBvg//aff/q785MyIKlv2icnWJTFIfkFhdKBfD36OI5Obur3ko17gJiRQ7BvIl
 07yBTOpcqYyq4pThFXA8TfiZyRxKv0faHeFYmtmEum6gZNtHPmpvfy1JFr12vjHZ
 SWSpylW6nprQJa61KwGHcRqTGuWe0sco6BBAQ/E015wJLKrtIpNCH5H3x3eDv8we
 nBPSHSu1BsJFTcf94ECZhjbsiLgLgTaYVFUU8tw7eTp7EkZz9xq6wYmhZ6e5Z6Ki
 AGbIQ/c16Nk5B3/8A1bU0pm7BHz9Zx1ucNy4KZqyFl2fWEmQiv1ofa3cfLI+ZAF2
 /TFzLW+rpvesc+Byk7PgzRrLNXAMBk+botzdlN8JPaPDFYcm6whnRKstmjYGAzCI
 /bpTI9wyEPzlfWtt5h2ws1KaP5HNl1OLd3sJCzfwH+nKaeJpdBBSaZRQ0yZWIsg5
 hzt9WnJ10351sbLbAA4ybsOlWJvhPtSo8jSiGUGzu7aFyfa7H82nslWHuFjeI3F0
 qgcA5j6LJ8qAviMnhxX5jrqIMNZzyf+P5/oJvXjfSeWft1yD6EUqRqb7i0FREjh6
 qzxiX6xhRSh8AIr3XlToG3W7KULQkBJKwZumSTQ2X03yJRhelqKAs3LvvyXoZuG3
 oFvGgiGNph0VSTTPdOSrZZ3tOCDiONXfHHOFpghk4qKdWgFoWtg=
 =yNwD
 -----END PGP SIGNATURE-----

Merge tag 'jfs-5.8' of git://github.com/kleikamp/linux-shaggy

Pull JFS update from David Kleikamp:
 "Replace zero-length array in JFS"

* tag 'jfs-5.8' of git://github.com/kleikamp/linux-shaggy:
  jfs: Replace zero-length array with flexible-array member
2020-06-02 20:11:35 -07:00
Linus Torvalds
f3cdc8ae11 for-5.8-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl7U50AACgkQxWXV+ddt
 WDtK1g//RXeNsTguYQr1N9R5eUPThjLEI0+4J0l4SYfCPU8Ou3C7nqpOEJJQgm8F
 ezZE+16cWi9U5uGueOc+w0rfyz4AuIXKgzoz+c0/GG2+yV5jp6DsAMbWqojAb96L
 V/N3HxEzR66jqwgVUBE/x5okb2SyY7//B1l/O0amc66XDO7KTMImpIwThere6zWZ
 o2SNpYpHAPQeUYJQx8h+FAW3w1CxrCZmnifazU9Jqe9J7QeQLg7rbUlJDV38jySm
 ZOA8ohKN9U1gPZy+dTU3kdyyuBIq1etkIaSPJANyTo5TczPKiC0IMg75cXtS4ae/
 NSxhccMpSIjVMcIHARzSFGYKNP3sGNRsmaTUg/2Cx/9GoHOhYMiCAVc8qtBBpwJO
 UI0siexrCe64RuTBMRRc128GdFv7IjmSImcdi8xaR62bCcUiNdEa3zvjRe/9tOEH
 ET7Z85oBnKpSzpC3MdhSUU4dtHY5XLawP8z3oUU1VSzSWM2DVjlHf79/VzbOfp18
 miCVpt94lCn/gUX7el6qcnbuvMAjDyeC6HmfD+TwzQgGwyV6TLgKN9lRXeH/Oy6/
 VgjGQSavGHMll3zIGURmrBCXKudjJg0J+IP4wN1TimmSEMfwKH+7tnekQd8y5qlF
 eXEIqlWNykKeDzEnmV9QJy+/cV83hVWM/mUslcTx39tLN/3B/Us=
 =qTt8
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Highlights:

   - speedup dead root detection during orphan cleanup, eg. when there
     are many deleted subvolumes waiting to be cleaned, the trees are
     now looked up in radix tree instead of a O(N^2) search

   - snapshot creation with inherited qgroup will mark the qgroup
     inconsistent, requires a rescan

   - send will emit file capabilities after chown, this produces a
     stream that does not need postprocessing to set the capabilities
     again

   - direct io ported to iomap infrastructure, cleaned up and simplified
     code, notably removing last use of struct buffer_head in btrfs code

  Core changes:

   - factor out backreference iteration, to be used by ordinary
     backreferences and relocation code

   - improved global block reserve utilization
      * better logic to serialize requests
      * increased maximum available for unlink
      * improved handling on large pages (64K)

   - direct io cleanups and fixes
      * simplify layering, where cloned bios were unnecessarily created
        for some cases
      * error handling fixes (submit, endio)
      * remove repair worker thread, used to avoid deadlocks during
        repair

   - refactored block group reading code, preparatory work for new type
     of block group storage that should improve mount time on large
     filesystems

  Cleanups:

   - cleaned up (and slightly sped up) set/get helpers for metadata data
     structure members

   - root bit REF_COWS got renamed to SHAREABLE to reflect the that the
     blocks of the tree get shared either among subvolumes or with the
     relocation trees

  Fixes:

   - when subvolume deletion fails due to ENOSPC, the filesystem is not
     turned read-only

   - device scan deals with devices from other filesystems that changed
     ownership due to overwrite (mkfs)

   - fix a race between scrub and block group removal/allocation

   - fix long standing bug of a runaway balance operation, printing the
     same line to the syslog, caused by a stale status bit on a reloc
     tree that prevented progress

   - fix corrupt log due to concurrent fsync of inodes with shared
     extents

   - fix space underflow for NODATACOW and buffered writes when it for
     some reason needs to fallback to COW mode"

* tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (133 commits)
  btrfs: fix space_info bytes_may_use underflow during space cache writeout
  btrfs: fix space_info bytes_may_use underflow after nocow buffered write
  btrfs: fix wrong file range cleanup after an error filling dealloc range
  btrfs: remove redundant local variable in read_block_for_search
  btrfs: open code key_search
  btrfs: split btrfs_direct_IO to read and write part
  btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
  fs: remove dio_end_io()
  btrfs: switch to iomap_dio_rw() for dio
  iomap: remove lockdep_assert_held()
  iomap: add a filesystem hook for direct I/O bio submission
  fs: export generic_file_buffered_read()
  btrfs: turn space cache writeout failure messages into debug messages
  btrfs: include error on messages about failure to write space/inode caches
  btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks()
  btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums
  btrfs: make checksum item extension more efficient
  btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
  btrfs: unexport btrfs_compress_set_level()
  btrfs: simplify iget helpers
  ...
2020-06-02 19:59:25 -07:00
Linus Torvalds
8eeae5bae1 (More) new code for 5.8:
- Introduce DONTCACHE flags for dentries and inodes.  This hint will
   cause the VFS to drop the associated objects immediately after the
   last put, so that we can change the file access mode (DAX or page
   cache) on the fly.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl68FowACgkQ+H93GTRK
 tOtNzA/9FkXXQYAlTWK/toHfJV8DQT/Kx1fvf8Ng0EphBUQa/rNzlcMzFg7Gw5Cs
 Rzis96+xj4q//iseLZN5LLxaoxqT2Qipza0GWCMJpQG/4wTWM0Ar7BnG/Vc87lUV
 F0mXnILZOUMFzr8Zj9q4ka6UGRTDSXXtwNXqBuPpIZyVbMQvPtXHhM3lWV5RUQwm
 fznBxDAEGoVXiyID2OrZD5tS4BMd16uFWAWLjWphpcy18zfC7zp0+0MQik4v/9oi
 54pZdtPT9/dQOu/BI8tfLP45XzZ6f++gXy2p/G96dy7ism1u40ML77ojEkadVVFe
 Bf7t+EswNxrx/em/ugWbcJDtrxttSqU47g2AXsbJJB2+aHCih6Cfid41lMyRvlhR
 d4cumoteX7IF/PpT3YaKHWQBo5OxHK0a2CBPd6czrCBw5yXrEUagdmw1XQ//bw5e
 FRCg4eMcEW0UgINvBCHWdWRx6VaL8ngMMsflVJ/lY7FeVvM10ZYRFzJoryoebSPm
 /yWcoHFsTPC8K0nWVmbwPazVE19I0g4y6Wiw39YvZDzZRzM9PcQI4DBxQcab+Va/
 FPfXEXkpz0GiC6zjs/QfkPtg60GI1IG5Um4JUzdv6ce1P0p1rGcu5WiNYearahE7
 7V/44WGIEAd4NP7R0JPTI0Fqv7v6uuDzMoCp7YDn8gE4FCJTt6M=
 =ebl3
 -----END PGP SIGNATURE-----

Merge tag 'vfs-5.8-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull DAX updates part two from Darrick Wong:
 "This time around, we're hoisting the DONTCACHE flag from XFS into the
  VFS so that we can make the incore DAX mode changes become effective
  sooner.

  We can't change the file data access mode on a live inode because we
  don't have a safe way to change the file ops pointers. The incore
  state change becomes effective at inode loading time, which can happen
  if the inode is evicted. Therefore, we're making it so that
  filesystems can ask the VFS to evict the inode as soon as the last
  holder drops.

  The per-fs changes to make this call this will be in subsequent pull
  requests from Ted and myself.

  Summary:

   - Introduce DONTCACHE flags for dentries and inodes. This hint will
     cause the VFS to drop the associated objects immediately after the
     last put, so that we can change the file access mode (DAX or page
     cache) on the fly"

* tag 'vfs-5.8-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  fs: Introduce DCACHE_DONTCACHE
  fs: Lift XFS_IDONTCACHE to the VFS layer
2020-06-02 19:48:41 -07:00
Linus Torvalds
96ed320d52 New code for 5.8:
- Clean up io_is_direct.
 - Add a new statx flag to indicate when file data access is being done
   via DAX (as opposed to the page cache).
 - Update the documentation for how system administrators and application
   programmers can take advantage of the (still experimental DAX) feature.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl6wO7UACgkQ+H93GTRK
 tOtEaw//eShC2YE0S+GS7ihQ3x71PJa4Is0VZOIpTHl01aqMSegwB3QbDQbVUhn1
 TzLhUw4pZsz3R9GbUOrfHOYRt+aSP2t0WNhIulDeBp41CYJQSaFt85KnfM9hoBOi
 VYssum3Lu7/6ReKrDD/mumzWYkts+JDCuXRmt7nOQeZJVNXOCBBbvN354V4/IKLY
 wB4Wnaq3f3gYniXYW/23aCX+kocaOIUZtK6aFKyeD0KvfP5toDlpw1cBVMoM9CmO
 bmEy8vKf4lgFZDLeDMqmWOecMgEH5h0baN5Psu13WuDCiCd6maBl0KpxVpVlwsep
 yVz6mMbZjmLOJ2lqyw+lZb+XicD+K3yRVSTGKxV3VbuRjeX9tjVG5Im13VesNvJB
 WWJq/CkOU8W0Zs7Q5RbUDGbFFWDJSI/OStAU+UeuWvL9Gndv7hqv6H904qbPPtEu
 4m4Y34ARzrEaKpkABKKwQ53cLClNxmmgUN9N3cXK3mk8idlX4zM3j6+HJYUxXTO+
 fBjhOlyUy2KaWmzZoJp28QvaU4iegGmMSuRnQ9HAvXmdxUA2K6+wjS6LCZGh04vz
 z7SbzTBlo2kvsKdRMwJ306s2QA0/HvmKHHLI+p8OQANce9hjhE3XdJayzhitd0fk
 k0D/y8OY+fbCSgI8C4g66lA8Zf2sos/ulD0QTTNBPfU2rRWkKUc=
 =rS19
 -----END PGP SIGNATURE-----

Merge tag 'vfs-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull DAX updates part one from Darrick Wong:
 "After many years of LKML-wrangling about how to enable programs to
  query and influence the file data access mode (DAX) when a filesystem
  resides on storage devices such as persistent memory, Ira Weiny has
  emerged with a proposed set of standard behaviors that has not been
  shot down by anyone! We're more or less standardizing on the current
  XFS behavior and adapting ext4 to do the same.

  This is the first of a handful pull requests that will make ext4 and
  XFS present a consistent interface for user programs that care about
  DAX. We add a statx attribute that programs can check to see if DAX is
  enabled on a particular file. Then, we update the DAX documentation to
  spell out the user-visible behaviors that filesystems will guarantee
  (until the next storage industry shakeup). The on-disk inode flag has
  been in XFS for a few years now.

  Summary:

   - Clean up io_is_direct.

   - Add a new statx flag to indicate when file data access is being
     done via DAX (as opposed to the page cache).

   - Update the documentation for how system administrators and
     application programmers can take advantage of the (still
     experimental DAX) feature"

Link: https://lore.kernel.org/lkml/20200505002016.1085071-1-ira.weiny@intel.com/

* tag 'vfs-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  Documentation/dax: Update Usage section
  fs/stat: Define DAX statx attribute
  fs: Remove unneeded IS_DAX() check in io_is_direct()
2020-06-02 19:45:12 -07:00
Linus Torvalds
16d91548d1 New code for 5.8:
- Various cleanups to remove dead code, unnecessary conditionals,
       asserts, etc.
     - Fix a linker warning caused by xfs stuffing '-g' into CFLAGS
       redundantly.
     - Tighten up our dmesg logging to ensure that everything is prefixed
       with 'XFS' for easier grepping.
     - Kill a bunch of typedefs.
     - Refactor the deferred ops code to reduce indirect function calls.
     - Increase type-safety with the deferred ops code.
     - Make the DAX mount options a tri-state.
     - Fix some error handling problems in the inode flush code and clean up
       other inode flush warts.
     - Refactor log recovery so that each log item recovery functions now live
       with the other log item processing code.
     - Fix some SPDX forms.
     - Fix quota counter corruption if the fs crashes after running
       quotacheck but before any dquots get logged.
     - Don't fail metadata verification on zero-entry attr leaf blocks, since
       they're just part of the disk format now due to a historic lack of log
       atomicity.
     - Don't allow SWAPEXT between files with different [ugp]id when quotas
       are enabled.
     - Refactor inode fork reading and verification to run directly from the
       inode-from-disk function.  This means that we now actually guarantee
       that _iget'ted inodes are totally verified and ready to go.
     - Move the incore inode fork format and extent counts to the ifork
       structure.
     - Scalability improvements by reducing cacheline pingponging in
       struct xfs_mount.
     - More scalability improvements by removing m_active_trans from the
       hot path.
     - Fix inode counter update sanity checking to run /only/ on debug
       kernels.
     - Fix longstanding inconsistency in what error code we return when a
       program hits project quota limits (ENOSPC).
     - Fix group quota returning the wrong error code when a program hits
       group quota limits.
     - Fix per-type quota limits and grace periods for group and project
       quotas so that they actually work.
     - Allow extension of individual grace periods.
     - Refactor the non-reclaim inode radix tree walking code to remove a
       bunch of stupid little functions and straighten out the
       inconsistent naming schemes.
     - Fix a bug in speculative preallocation where we measured a new
       allocation based on the last extent mapping in the file instead of
       looking farther for the last contiguous space allocation.
     - Force delalloc writes to unwritten extents.  This closes a
       stale disk contents exposure vector if the system goes down before
       the write completes.
     - More lockdep whackamole.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl7OjhgACgkQ+H93GTRK
 tOuGeBAApuP9ohtvrJT9FW7U+OrRsK3lw/3R+MEYpJu8GKLpGbJ6j+SKrTHxxLvu
 Rp63YLIlHBOz2rNa4brm/wW8gGJIGXOnGpuiGq0Irl01xEmwqmjOLfLcYkYhno1E
 i+rG0PiKYZeo/xhLtTKGl+NAwHHxmbOmxUtYHnbinHtPzDyYLQ0wff+oUkmQ7ydg
 bMYFMXohoJ3Pc5UjmUrCuJj1cvYOUwl0P4LGKiq5Zud61AkBCSskEpk+oo5xFcEX
 JJc1xkn5MPi+oGpSYqhnSZ6aSjwp53/i44O9volp5vCRXXv1eLVni2u/ScZ85L72
 HXxoDyuZOUupirIfMBQFHsazDGPGyFIqtPhGlXoTJjrwX+ymimY6CU/0e+Xu9DEu
 krlxajfUssH30zyG2q/2TaxslU35CROH6hVBXFe0Y5cEEsOIf2aOpErUhhw2YyS7
 onN9gb2NBBQdYtHqIMwsbhcgq60g5H6JfGriB5dJimXXLmpuTfAREGCY2AqIoB1x
 +8QFod0WwsMn6FYhi/UpZjC9qp/WTvojBUEt8Ci3ketUFwO1CLf9qm6Hj71RL3fs
 fCEDHx/ZMMft7Bdbf36lICoMAhF/KfNcRn1PsQdpW4LY1Aml/7qjFNZthSVRDW+E
 rhzNu+RIzGEQsSemBvccRaaTP3HFqN+qPATu2K0sALaa1LRFxzQ=
 =/NYc
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.8-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "Most of the changes this cycle are refactoring of existing code in
  preparation for things landing in the future.

  We also fixed various problems and deficiencies in the quota
  implementation, and (I hope) the last of the stale read vectors by
  forcing write allocations to go through the unwritten state until the
  write completes.

  Summary:

   - Various cleanups to remove dead code, unnecessary conditionals,
     asserts, etc.

   - Fix a linker warning caused by xfs stuffing '-g' into CFLAGS
     redundantly.

   - Tighten up our dmesg logging to ensure that everything is prefixed
     with 'XFS' for easier grepping.

   - Kill a bunch of typedefs.

   - Refactor the deferred ops code to reduce indirect function calls.

   - Increase type-safety with the deferred ops code.

   - Make the DAX mount options a tri-state.

   - Fix some error handling problems in the inode flush code and clean
     up other inode flush warts.

   - Refactor log recovery so that each log item recovery functions now
     live with the other log item processing code.

   - Fix some SPDX forms.

   - Fix quota counter corruption if the fs crashes after running
     quotacheck but before any dquots get logged.

   - Don't fail metadata verification on zero-entry attr leaf blocks,
     since they're just part of the disk format now due to a historic
     lack of log atomicity.

   - Don't allow SWAPEXT between files with different [ugp]id when
     quotas are enabled.

   - Refactor inode fork reading and verification to run directly from
     the inode-from-disk function. This means that we now actually
     guarantee that _iget'ted inodes are totally verified and ready to
     go.

   - Move the incore inode fork format and extent counts to the ifork
     structure.

   - Scalability improvements by reducing cacheline pingponging in
     struct xfs_mount.

   - More scalability improvements by removing m_active_trans from the
     hot path.

   - Fix inode counter update sanity checking to run /only/ on debug
     kernels.

   - Fix longstanding inconsistency in what error code we return when a
     program hits project quota limits (ENOSPC).

   - Fix group quota returning the wrong error code when a program hits
     group quota limits.

   - Fix per-type quota limits and grace periods for group and project
     quotas so that they actually work.

   - Allow extension of individual grace periods.

   - Refactor the non-reclaim inode radix tree walking code to remove a
     bunch of stupid little functions and straighten out the
     inconsistent naming schemes.

   - Fix a bug in speculative preallocation where we measured a new
     allocation based on the last extent mapping in the file instead of
     looking farther for the last contiguous space allocation.

   - Force delalloc writes to unwritten extents. This closes a stale
     disk contents exposure vector if the system goes down before the
     write completes.

   - More lockdep whackamole"

* tag 'xfs-5.8-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (129 commits)
  xfs: more lockdep whackamole with kmem_alloc*
  xfs: force writes to delalloc regions to unwritten
  xfs: refactor xfs_iomap_prealloc_size
  xfs: measure all contiguous previous extents for prealloc size
  xfs: don't fail unwritten extent conversion on writeback due to edquot
  xfs: rearrange xfs_inode_walk_ag parameters
  xfs: straighten out all the naming around incore inode tree walks
  xfs: move xfs_inode_ag_iterator to be closer to the perag walking code
  xfs: use bool for done in xfs_inode_ag_walk
  xfs: fix inode ag walk predicate function return values
  xfs: refactor eofb matching into a single helper
  xfs: remove __xfs_icache_free_eofblocks
  xfs: remove flags argument from xfs_inode_ag_walk
  xfs: remove xfs_inode_ag_iterator_flags
  xfs: remove unused xfs_inode_ag_iterator function
  xfs: replace open-coded XFS_ICI_NO_TAG
  xfs: move eofblocks conversion function to xfs_ioctl.c
  xfs: allow individual quota grace period extension
  xfs: per-type quota timers and warn limits
  xfs: switch xfs_get_defquota to take explicit type
  ...
2020-06-02 19:21:40 -07:00
Linus Torvalds
1ee08de1e2 for-5.8/io_uring-2020-06-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VP+kQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuK4D/0XsSG/Yirbba1rrbqw/qpw9xcAs9oyN0tS
 8SmmGN27ghrkVSsGBXNcG+PSTu3pkkLjYZ6TQtKamrya9G+lRAsKRsQ+Yq+7Qv4e
 N6lCUlLJ99KqTMtwvIoxSpA1tz3ENHucOw2cJrw3kd9G0kil7GvDkIOBasd+kmwn
 ak+mnMJZzRhqSM7M5lKQOk8l92gKBHGbPy4xKb0st3dQkYptDvit0KcNSAuevtOp
 sRZpdbXaT3FA6xa5iEgggI6vZQGVmK1EaGoQqZ8vgVo75aovkjZyQWWiFVVOlEqr
 QjUCCQuixcbMRbZjgpojqva5nmLhFVhLCfoSH2XgttEQZhmTwypdRwM2/IlxV5q2
 xCofrDkhYOfIgHkuP6p68ukIPIfQ+4jotvsmXZ/HeD/xbx3TRyJRZadISr6wiuLm
 7zRXWaGCYomUIPJOOrpBQ9FsCglkaN63oB6VGuGKTg3g7kE2QrZ2/aGuexP+FAdh
 OrA8BlzxZzpqMKhjQVKOl9r6FU928MZn8nIAkMdQ/Ia1mOpb4rrPo4qCdf+tbhPO
 pmKtQPQjbszQ3UfTgShvfvDk43BeRim1DxZPFTauSu1FMpqWBCwQgXMynPFrf5TR
 HXF61G+jw5swDW6uJgW7bXdm7hHr15vRqQr54MgGS+T0OOa1df9MR0dJB5CGklfI
 ycLU6AAT+A==
 =A/qA
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "A relatively quiet round, mostly just fixes and code improvements. In
particular:

   - Make statx just use the generic statx handler, instead of open
     coding it. We don't need that anymore, as we always call it async
     safe (Bijan)

   - Enable closing of the ring itself. Also fixes O_PATH closure (me)

   - Properly name completion members (me)

   - Batch reap of dead file registrations (me)

   - Allow IORING_OP_POLL with double waitqueues (me)

   - Add tee(2) support (Pavel)

   - Remove double off read (Pavel)

   - Fix overflow cancellations (Pavel)

   - Improve CQ timeouts (Pavel)

   - Async defer drain fixes (Pavel)

   - Add support for enabling/disabling notifications on a registered
     eventfd (Stefano)

   - Remove dead state parameter (Xiaoguang)

   - Disable SQPOLL submit on dying ctx (Xiaoguang)

   - Various code cleanups"

* tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-block: (29 commits)
  io_uring: fix overflowed reqs cancellation
  io_uring: off timeouts based only on completions
  io_uring: move timeouts flushing to a helper
  statx: hide interfaces no longer used by io_uring
  io_uring: call statx directly
  statx: allow system call to be invoked from io_uring
  io_uring: add io_statx structure
  io_uring: get rid of manual punting in io_close
  io_uring: separate DRAIN flushing into a cold path
  io_uring: don't re-read sqe->off in timeout_prep()
  io_uring: simplify io_timeout locking
  io_uring: fix flush req->refs underflow
  io_uring: don't submit sqes when ctx->refs is dying
  io_uring: async task poll trigger cleanup
  io_uring: add tee(2) support
  splice: export do_tee()
  io_uring: don't repeat valid flag list
  io_uring: rename io_file_put()
  io_uring: remove req->needs_fixed_files
  io_uring: cleanup io_poll_remove_one() logic
  ...
2020-06-02 15:42:50 -07:00
Linus Torvalds
bce159d734 for-5.8/drivers-2020-06-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VPc4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgQkEACnQlzWOfNQMz1AzgUAv/S8IYDJCLrkbjLZ
 JK4pJv8Hjhss/7sS+fd8kyKe9VtaZz2IjmrXcC66RMMwtpx4iHnkRffoNAgEdGOl
 /M5TCZGhs+F/mp3Lc0WdR5DFHkM6yy2Tkk9wCFLreB4bW67janAWnd7nbU4INqJj
 +WqIgpzNMc/kfUhpBYTeQLORhL4e2TG9ADTi/zeUITlpnEsA65LOgXKEpeIFYnSX
 KTl4GIZ9tjazG3Y1Eva7DYHDIErNNAtX67KBqf+WBgMV98eB0O6xIPN1WlmhDTqj
 FGMLkb8msH1HHntvxDAuc4/ortnUy8vPI4o6zKP89HJJNjIM5p5eHEuVF5JnBw42
 Rtu9Om6JqWx51nhAhJNBj9bUStYbhEl0vVQCwbkfPbDJhzTy3RR8z709q9+ZwOrL
 xbp4aJBzqrzscjBEiSQbNCf2PyuOAdU0r1x81UN81ZN41d5qUcumcinjw4Y7vru8
 z5zMlo1Iy/AWQYyu7jgHmnpI7ZyA/1Qclo5dV7aa72bLFaJa35e7QxgfQOFBA5dY
 UZl6QPJRlnB80uGRzD5jCh2O2sQ3XZqYnpaKsUAka1GgbceCp9IC4A5mfZvpACsh
 Xk8VXjlhvY/iPJsKLqrh4Oedg4Dj5M3PLL9C3MDfYeIP2qgXpbnk87UV1TPNSpY0
 QcTxsXXXIw==
 =H+/Z
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "On top of the core changes, here are the block driver changes for this
  merge window:

   - NVMe changes:
        - NVMe over Fibre Channel protocol updates, which also reach
          over to drivers/scsi/lpfc (James Smart)
        - namespace revalidation support on the target (Anthony
          Iliopoulos)
        - gcc zero length array fix (Arnd Bergmann)
        - nvmet cleanups (Chaitanya Kulkarni)
        - misc cleanups and fixes (me, Keith Busch, Sagi Grimberg)
        - use a SRQ per completion vector (Max Gurtovoy)
        - fix handling of runtime changes to the queue count (Weiping
          Zhang)
        - t10 protection information support for nvme-rdma and
          nvmet-rdma (Israel Rukshin and Max Gurtovoy)
        - target side AEN improvements (Chaitanya Kulkarni)
        - various fixes and minor improvements all over, icluding the
          nvme part of the lpfc driver"

   - Floppy code cleanup series (Willy, Denis)

   - Floppy contention fix (Jiri)

   - Loop CONFIGURE support (Martijn)

   - bcache fixes/improvements (Coly, Joe, Colin)

   - q->queuedata cleanups (Christoph)

   - Get rid of ioctl_by_bdev (Christoph, Stefan)

   - md/raid5 allocation fixes (Coly)

   - zero length array fixes (Gustavo)

   - swim3 task state fix (Xu)"

* tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block: (166 commits)
  bcache: configure the asynchronous registertion to be experimental
  bcache: asynchronous devices registration
  bcache: fix refcount underflow in bcache_device_free()
  bcache: Convert pr_<level> uses to a more typical style
  bcache: remove redundant variables i and n
  lpfc: Fix return value in __lpfc_nvme_ls_abort
  lpfc: fix axchg pointer reference after free and double frees
  lpfc: Fix pointer checks and comments in LS receive refactoring
  nvme: set dma alignment to qword
  nvmet: cleanups the loop in nvmet_async_events_process
  nvmet: fix memory leak when removing namespaces and controllers concurrently
  nvmet-rdma: add metadata/T10-PI support
  nvmet: add metadata support for block devices
  nvmet: add metadata/T10-PI support
  nvme: add Metadata Capabilities enumerations
  nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len
  nvmet: rename nvmet_rw_len to nvmet_rw_data_len
  nvmet: add metadata characteristics for a namespace
  nvme-rdma: add metadata/T10-PI support
  nvme-rdma: introduce nvme_rdma_sgl structure
  ...
2020-06-02 15:37:03 -07:00
Linus Torvalds
750a02ab8d for-5.8/block-2020-06-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VOwMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoR7EADAlz3TCkb4wwuHytTBDrm6gVDdsJ9zUfQW
 Cl2ASLtufA8PWZUCEI3vhFyOe6P5e+ZZ0O2HjljSevmHyogCaRYXFYVfbWKcQKuk
 AcxiTgnYNevh8KbGLfJY1WL4eXsY+C3QUGivg35cCgrx+kr9oDaHMeqA9Tm1plyM
 FSprDBoSmHPqRxiV/1gnr8uXLX6K7i/fHzwmKgySMhavum7Ma8W3wdAGebzvQwrO
 SbFSuJVgz06e4B1Fzr/wSvVNUE/qW/KqfGuQKIp7VQFIywbgG7TgRMHjE1FSnpnh
 gn+BfL+O5gc0sTvcOTGOE0SRWWwLx961WNg8Azq08l3fzsxLA6h8/AnoDf3i+QMA
 rHmLpWZIic2xPSvjaFHX3/V9ITyGYeAMpAR77EL+4ivWrKv5JrBhnSLDt1fKILdg
 5elxm7RDI+C4nCP4xuTlVCy5gCd6gwjgytKj+NUWhNq1WiGAD0B54SSiV+SbCSH6
 Om2f5trcxz8E4pqWcf0k3LjFapVKRNV8v/+TmVkCdRPBl3y9P0h0wFTkkcEquqnJ
 y7Yq6efdWviRCnX5w/r/yj0qBuk4xo5hMVsPmlthCWtnBm+xZQ6LwMRcq4HQgZgR
 2SYNscZ3OFMekHssH7DvY4DAy1J+n83ims+KzbScbLg2zCZjh/scQuv38R5Eh9WZ
 rCS8c+T7Ig==
 =HYf4
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Core block changes that have been queued up for this release:

   - Remove dead blk-throttle and blk-wbt code (Guoqing)

   - Include pid in blktrace note traces (Jan)

   - Don't spew I/O errors on wouldblock termination (me)

   - Zone append addition (Johannes, Keith, Damien)

   - IO accounting improvements (Konstantin, Christoph)

   - blk-mq hardware map update improvements (Ming)

   - Scheduler dispatch improvement (Salman)

   - Inline block encryption support (Satya)

   - Request map fixes and improvements (Weiping)

   - blk-iocost tweaks (Tejun)

   - Fix for timeout failing with error injection (Keith)

   - Queue re-run fixes (Douglas)

   - CPU hotplug improvements (Christoph)

   - Queue entry/exit improvements (Christoph)

   - Move DMA drain handling to the few drivers that use it (Christoph)

   - Partition handling cleanups (Christoph)"

* tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits)
  block: mark bio_wouldblock_error() bio with BIO_QUIET
  blk-wbt: rename __wbt_update_limits to wbt_update_limits
  blk-wbt: remove wbt_update_limits
  blk-throttle: remove tg_drain_bios
  blk-throttle: remove blk_throtl_drain
  null_blk: force complete for timeout request
  blk-mq: drain I/O when all CPUs in a hctx are offline
  blk-mq: add blk_mq_all_tag_iter
  blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx
  blk-mq: use BLK_MQ_NO_TAG in more places
  blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
  blk-mq: move more request initialization to blk_mq_rq_ctx_init
  blk-mq: simplify the blk_mq_get_request calling convention
  blk-mq: remove the bio argument to ->prepare_request
  nvme: force complete cancelled requests
  blk-mq: blk-mq: provide forced completion method
  block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds
  block: blk-crypto-fallback: remove redundant initialization of variable err
  block: reduce part_stat_lock() scope
  block: use __this_cpu_add() instead of access by smp_processor_id()
  ...
2020-06-02 15:29:19 -07:00
Miklos Szeredi
56230d9567 ovl: verify permissions in ovl_path_open()
Check permission before opening a real file.

ovl_path_open() is used by readdir and copy-up routines.

ovl_permission() theoretically already checked copy up permissions, but it
doesn't hurt to re-do these checks during the actual copy-up.

For directory reading ovl_permission() only checks access to topmost
underlying layer.  Readdir on a merged directory accesses layers below the
topmost one as well.  Permission wasn't checked for these layers.

Note: modifying ovl_permission() to perform this check would be far more
complex and hence more bug prone.  The result is less precise permissions
returned in access(2).  If this turns out to be an issue, we can revisit
this bug.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:26 +02:00
Miklos Szeredi
48bd024b8a ovl: switch to mounter creds in readdir
In preparation for more permission checking, override credentials for
directory operations on the underlying filesystems.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:25 +02:00
Miklos Szeredi
130fdbc3d1 ovl: pass correct flags for opening real directory
The three instances of ovl_path_open() in overlayfs/readdir.c do three
different things:

 - pass f_flags from overlay file
 - pass O_RDONLY | O_DIRECTORY
 - pass just O_RDONLY

The value of f_flags can be (other than O_RDONLY):

O_WRONLY	- not possible for a directory
O_RDWR		- not possible for a directory
O_CREAT		- masked out by dentry_open()
O_EXCL		- masked out by dentry_open()
O_NOCTTY	- masked out by dentry_open()
O_TRUNC		- masked out by dentry_open()
O_APPEND	- no effect on directory ops
O_NDELAY	- no effect on directory ops
O_NONBLOCK	- no effect on directory ops
__O_SYNC	- no effect on directory ops
O_DSYNC		- no effect on directory ops
FASYNC		- no effect on directory ops
O_DIRECT	- no effect on directory ops
O_LARGEFILE	- ?
O_DIRECTORY	- only affects lookup
O_NOFOLLOW	- only affects lookup
O_NOATIME	- overlay sets this unconditionally in ovl_path_open()
O_CLOEXEC	- only affects fd allocation
O_PATH		- no effect on directory ops
__O_TMPFILE	- not possible for a directory


Fon non-merge directories we use the underlying filesystem's iterate; in
this case honor O_LARGEFILE from the original file to make sure that open
doesn't get rejected.

For merge directories it's safe to pass O_LARGEFILE unconditionally since
userspace will only see the artificial offsets created by overlayfs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:25 +02:00
Vivek Goyal
21d8d66abf ovl: fix redirect traversal on metacopy dentries
Amir pointed me to metacopy test cases in unionmount-testsuite and I
decided to run "./run --ov=10 --meta" and it failed while running test
"rename-mass-5.py".

Problem is w.r.t absolute redirect traversal on intermediate metacopy
dentry.  We do not store intermediate metacopy dentries and also skip
current loop/layer and move onto lookup in next layer.  But at the end of
loop, we have logic to reset "poe" and layer index if currnently looked up
dentry has absolute redirect.  We skip all that and that means lookup in
next layer will fail.

Following is simple test case to reproduce this.

- mkdir -p lower upper work merged lower/a lower/b
- touch lower/a/foo.txt
- mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work,metacopy=on none merged

# Following will create absolute redirect "/a/foo.txt" on upper/b/bar.txt.
- mv merged/a/foo.txt merged/b/bar.txt

# unmount overlay and use upper as lower layer (lower2) for next mount.
- umount merged
- mv upper lower2
- rm -rf work; mkdir -p upper work
- mount -t overlay -o lowerdir=lower2:lower,upperdir=upper,workdir=work,metacopy=on none merged

# Force a metacopy copy-up
- chown bin:bin merged/b/bar.txt

# unmount overlay and use upper as lower layer (lower3) for next mount.
- umount merged
- mv upper lower3
- rm -rf work; mkdir -p upper work
- mount -t overlay -o lowerdir=lower3:lower2:lower,upperdir=upper,workdir=work,metacopy=on none merged

# ls merged/b/bar.txt
ls: cannot access 'bar.txt': Input/output error

Intermediate lower layer (lower2) has metacopy dentry b/bar.txt with
absolute redirect "/a/foo.txt".  We skipped redirect processing at the end
of loop which sets poe to roe and sets the appropriate next lower layer
index.  And that means lookup failed in next layer.

Fix this by continuing the loop for any intermediate dentries.  We still do
not save these at lower stack.  With this fix applied unionmount-testsuite,
"./run --ov-10 --meta" now passes.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:25 +02:00
Vivek Goyal
28166ab3c8 ovl: initialize OVL_UPPERDATA in ovl_lookup()
Currently ovl_get_inode() initializes OVL_UPPERDATA flag and for that it
has to call ovl_check_metacopy_xattr() and check if metacopy xattr is
present or not.

yangerkun reported sometimes underlying filesystem might return -EIO and in
that case error handling path does not cleanup properly leading to various
warnings.

Run generic/461 with ext4 upper/lower layer sometimes may trigger the bug
as below(linux 4.19):

[  551.001349] overlayfs: failed to get metacopy (-5)
[  551.003464] overlayfs: failed to get inode (-5)
[  551.004243] overlayfs: cleanup of 'd44/fd51' failed (-5)
[  551.004941] overlayfs: failed to get origin (-5)
[  551.005199] ------------[ cut here ]------------
[  551.006697] WARNING: CPU: 3 PID: 24674 at fs/inode.c:1528 iput+0x33b/0x400
...
[  551.027219] Call Trace:
[  551.027623]  ovl_create_object+0x13f/0x170
[  551.028268]  ovl_create+0x27/0x30
[  551.028799]  path_openat+0x1a35/0x1ea0
[  551.029377]  do_filp_open+0xad/0x160
[  551.029944]  ? vfs_writev+0xe9/0x170
[  551.030499]  ? page_counter_try_charge+0x77/0x120
[  551.031245]  ? __alloc_fd+0x160/0x2a0
[  551.031832]  ? do_sys_open+0x189/0x340
[  551.032417]  ? get_unused_fd_flags+0x34/0x40
[  551.033081]  do_sys_open+0x189/0x340
[  551.033632]  __x64_sys_creat+0x24/0x30
[  551.034219]  do_syscall_64+0xd5/0x430
[  551.034800]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

One solution is to improve error handling and call iget_failed() if error
is encountered.  Amir thinks that this path is little intricate and there
is not real need to check and initialize OVL_UPPERDATA in ovl_get_inode().
Instead caller of ovl_get_inode() can initialize this state.  And this will
avoid double checking of metacopy xattr lookup in ovl_lookup() and
ovl_get_inode().

OVL_UPPERDATA is inode flag.  So I was little concerned that initializing
it outside ovl_get_inode() might have some races.  But this is one way
transition.  That is once a file has been fully copied up, it can't go back
to metacopy file again.  And that seems to help avoid races.  So as of now
I can't see any races w.r.t OVL_UPPERDATA being set wrongly.  So move
settingof OVL_UPPERDATA inside the callers of ovl_get_inode().
ovl_obtain_alias() already does it.  So only two callers now left are
ovl_lookup() and ovl_instantiate().

Reported-by: yangerkun <yangerkun@huawei.com>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:25 +02:00
Vivek Goyal
6815f479ca ovl: use only uppermetacopy state in ovl_lookup()
Currently we use a variable "metacopy" which signifies that dentry could be
either uppermetacopy or lowermetacopy.  Amir suggested that we can move
code around and use d.metacopy in such a way that we don't need
lowermetacopy and just can do away with uppermetacopy.

So this patch replaces "metacopy" with "uppermetacopy".

It also moves some code little higher to keep reading little simpler.

Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:25 +02:00
Vivek Goyal
59fb20138a ovl: simplify setting of origin for index lookup
overlayfs can keep index of copied up files and directories and it seems to
serve two primary puroposes.  For regular files, it avoids breaking lower
hardlinks over copy up.  For directories it seems to be used for various
error checks.

During ovl_lookup(), we lookup for index using lower dentry in many a
cases.  That lower dentry is called "origin" and following is a summary of
current logic.

If there is no upperdentry, always lookup for index using lower dentry.
For regular files it helps avoiding breaking hard links over copyup and for
directories it seems to be just error checks.

If there is an upperdentry, then there are 3 possible cases.

 - For directories, lower dentry is found using two ways.  One is regular
  path based lookup in lower layers and second is using ORIGIN xattr on
  upper dentry.  First verify that path based lookup lower dentry matches
  the one pointed by upper ORIGIN xattr.  If yes, use this verified origin
  for index lookup.

 - For regular files (non-metacopy), there is no path based lookup in lower
  layers as lookup stops once we find upper dentry.  So there is no origin
  verification.  If there is ORIGIN xattr present on upper, use that to
  lookup index otherwise don't.

 - For regular metacopy files, again lower dentry is found using path based
  lookup as well as ORIGIN xattr on upper.  Path based lookup is continued
  in this case to find lower data dentry for metacopy upper.  So like
  directories we only use verified origin.  If ORIGIN xattr is not present
  (Either because lower did not support file handles or because this is
  hardlink copied up with index=off), then don't use path lookup based
  lower dentry as origin.  This is same as regular non-metacopy file case.

Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:25 +02:00
Amir Goldstein
522f6e6cba ovl: fix out of bounds access warning in ovl_check_fb_len()
syzbot reported out of bounds memory access from open_by_handle_at()
with a crafted file handle that looks like this:

  { .handle_bytes = 2, .handle_type = OVL_FILEID_V1 }

handle_bytes gets rounded down to 0 and we end up calling:
  ovl_check_fh_len(fh, 0) => ovl_check_fb_len(fh + 3, -3)

But fh buffer is only 2 bytes long, so accessing struct ovl_fb at
fh + 3 is illegal.

Fixes: cbe7fba8ed ("ovl: make sure that real fid is 32bit aligned in memory")
Reported-and-tested-by: syzbot+61958888b1c60361a791@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org> # v5.5
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:25 +02:00
Linus Torvalds
355ba37d75 Power management updates for 5.8-rc1
- Rework the system-wide PM driver flags to make them easier to
    understand and use and update their documentation (Rafael Wysocki,
    Alan Stern).
 
  - Allow cpuidle governors to be switched at run time regardless of
    the kernel configuration and update the related documentation
    accordingly (Hanjun Guo).
 
  - Improve the resume device handling in the user space hibernarion
    interface code (Domenico Andreoli).
 
  - Document the intel-speed-select sysfs interface (Srinivas
    Pandruvada).
 
  - Make the ACPI code handing suspend to idle print more debug
    messages to help diagnose issues with it (Rafael Wysocki).
 
  - Fix a helper routine in the cpufreq core and correct a typo in
    the struct cpufreq_driver kerneldoc comment (Rafael Wysocki, Wang
    Wenhu).
 
  - Update cpufreq drivers:
 
    * Make the intel_pstate driver start in the passive mode by
      default on systems without HWP (Rafael Wysocki).
 
    * Add i.MX7ULP support to the imx-cpufreq-dt driver and add
      i.MX7ULP to the cpufreq-dt-platdev blacklist (Peng Fan).
 
    * Convert the qoriq cpufreq driver to a platform one, make the
      platform code create a suitable device object for it and add
      platform dependencies to it (Mian Yousaf Kaukab, Geert
      Uytterhoeven).
 
    * Fix wrong compatible binding in the qcom driver (Ansuel Smith).
 
    * Build the omap driver by default for ARCH_OMAP2PLUS (Anders
      Roxell).
 
    * Add r8a7742 SoC support to the dt cpufreq driver (Lad Prabhakar).
 
  - Update cpuidle core and drivers:
 
    * Fix three reference count leaks in error code paths in the
      cpuidle core (Qiushi Wu).
 
    * Convert Qualcomm SPM to a generic cpuidle driver (Stephan
      Gerhold).
 
    * Fix up the execution order when entering a domain idle state in
      the PSCI driver (Ulf Hansson).
 
  - Fix a reference counting issue related to clock management and
    clean up two oddities in the PM-runtime framework (Rafael Wysocki,
    Andy Shevchenko).
 
  - Add ElkhartLake support to the Intel RAPL power capping driver
    and remove an unused local MSR definition from it (Jacob Pan,
    Sumeet Pawnikar).
 
  - Update devfreq core and drivers:
 
    * Replace strncpy() with strscpy() in the devfreq core and use
      lockdep asserts instead of manual checks for a locked mutex in
      it (Dmitry Osipenko, Krzysztof Kozlowski).
 
    * Add a generic imx bus scaling driver and make it register an
      interconnect device (Leonard Crestez, Gustavo A. R. Silva).
 
    * Make the cpufreq notifier in the tegra30 driver take boosting
      into account and delete an unuseful error message from that
      driver (Dmitry Osipenko, Markus Elfring).
 
  - Remove unneeded semicolon from the cpupower code (Zou Wei).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl7VGjwSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRx46gP/jGAXlddFEQswi6qUT3Cff0A9mb8CdcX
 dyKrjX4xxo/wtBIAwSN4achxrgse//ayo2dYTzWRDd31W9Azbv+5F+46XsDRz4hL
 pH29u/E66NMtFWnHCmt78NEJn0FzSa0YBC43ZzwFwKktCK9skYIpGN2z6iuXUBSX
 Q5GHqop3zvDsdKQFBGL62xvUw/AmOTPG7ohIZvqWBN2mbOqEqMcoFHT+aUF/NbLj
 +i14dvTH767eDZGRVASmXWQyljjaRWm+SIw4+m8zT1D1Y3d5IFObuMN+9RQl1Tif
 BYjkgJ2oDDMhCJLW7TBuJB+g7exiyaSQds3nMr2ZR+eZbJipICjU4eehNEKIUopU
 DM17tHQfnwZfS/7YbCx3vYQwLkNq37AJyXS9uqCAIFM+0n4xN4/mIVmgWYISLDTs
 1v9olFxtwMRNpjGGQWPJAO7ebB8Zz9qhQv7pIkSQEfwp93/SzvlVf4vvruTeFN9J
 qqG60cDumXWAm+s43eQHJNn5nOd5ocWv0FBpo/cxqKbzxFVWwdB42Cm0SY+rK2ID
 uHdnc2DJcK2c78UVbz3Cmk4272foJt2zxchqjFXXAZPLrOsFfzmti4B28VxGxjmP
 LG3MhH5sdbF4yl/1aSC1Bnrt+PV9Lus6ut/VKhjwIpw8cqiXgpwSbMoDoaBd9UMQ
 ubGz2rplGAtB
 =APdj
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These rework the system-wide PM driver flags, make runtime switching
  of cpuidle governors easier, improve the user space hibernation
  interface code, add intel-speed-select interface documentation, add
  more debug messages to the ACPI code handling suspend to idle, update
  the cpufreq core and drivers, fix a minor issue in the cpuidle core
  and update two cpuidle drivers, improve the PM-runtime framework,
  update the Intel RAPL power capping driver, update devfreq core and
  drivers, and clean up the cpupower utility.

  Specifics:

   - Rework the system-wide PM driver flags to make them easier to
     understand and use and update their documentation (Rafael Wysocki,
     Alan Stern).

   - Allow cpuidle governors to be switched at run time regardless of
     the kernel configuration and update the related documentation
     accordingly (Hanjun Guo).

   - Improve the resume device handling in the user space hibernarion
     interface code (Domenico Andreoli).

   - Document the intel-speed-select sysfs interface (Srinivas
     Pandruvada).

   - Make the ACPI code handing suspend to idle print more debug
     messages to help diagnose issues with it (Rafael Wysocki).

   - Fix a helper routine in the cpufreq core and correct a typo in the
     struct cpufreq_driver kerneldoc comment (Rafael Wysocki, Wang
     Wenhu).

   - Update cpufreq drivers:

      - Make the intel_pstate driver start in the passive mode by
        default on systems without HWP (Rafael Wysocki).

      - Add i.MX7ULP support to the imx-cpufreq-dt driver and add
        i.MX7ULP to the cpufreq-dt-platdev blacklist (Peng Fan).

      - Convert the qoriq cpufreq driver to a platform one, make the
        platform code create a suitable device object for it and add
        platform dependencies to it (Mian Yousaf Kaukab, Geert
        Uytterhoeven).

      - Fix wrong compatible binding in the qcom driver (Ansuel Smith).

      - Build the omap driver by default for ARCH_OMAP2PLUS (Anders
        Roxell).

      - Add r8a7742 SoC support to the dt cpufreq driver (Lad
        Prabhakar).

   - Update cpuidle core and drivers:

      - Fix three reference count leaks in error code paths in the
        cpuidle core (Qiushi Wu).

      - Convert Qualcomm SPM to a generic cpuidle driver (Stephan
        Gerhold).

      - Fix up the execution order when entering a domain idle state in
        the PSCI driver (Ulf Hansson).

   - Fix a reference counting issue related to clock management and
     clean up two oddities in the PM-runtime framework (Rafael Wysocki,
     Andy Shevchenko).

   - Add ElkhartLake support to the Intel RAPL power capping driver and
     remove an unused local MSR definition from it (Jacob Pan, Sumeet
     Pawnikar).

   - Update devfreq core and drivers:

      - Replace strncpy() with strscpy() in the devfreq core and use
        lockdep asserts instead of manual checks for a locked mutex in
        it (Dmitry Osipenko, Krzysztof Kozlowski).

      - Add a generic imx bus scaling driver and make it register an
        interconnect device (Leonard Crestez, Gustavo A. R. Silva).

      - Make the cpufreq notifier in the tegra30 driver take boosting
        into account and delete an unuseful error message from that
        driver (Dmitry Osipenko, Markus Elfring).

   - Remove unneeded semicolon from the cpupower code (Zou Wei)"

* tag 'pm-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (51 commits)
  cpuidle: Fix three reference count leaks
  PM: runtime: Replace pm_runtime_callbacks_present()
  PM / devfreq: Use lockdep asserts instead of manual checks for locked mutex
  PM / devfreq: imx-bus: Fix inconsistent IS_ERR and PTR_ERR
  PM / devfreq: Replace strncpy with strscpy
  PM / devfreq: imx: Register interconnect device
  PM / devfreq: Add generic imx bus scaling driver
  PM / devfreq: tegra30: Delete an error message in tegra_devfreq_probe()
  PM / devfreq: tegra30: Make CPUFreq notifier to take into account boosting
  PM: hibernate: Restrict writes to the resume device
  PM: runtime: clk: Fix clk_pm_runtime_get() error path
  cpuidle: Convert Qualcomm SPM driver to a generic CPUidle driver
  ACPI: EC: PM: s2idle: Extend GPE dispatching debug message
  ACPI: PM: s2idle: Print type of wakeup debug messages
  powercap: RAPL: remove unused local MSR define
  PM: runtime: Make clear what we do when conditions are wrong in rpm_suspend()
  Documentation: admin-guide: pm: Document intel-speed-select
  PM: hibernate: Split off snapshot dev option
  PM: hibernate: Incorporate concurrency handling
  Documentation: ABI: make current_governer_ro as a candidate for removal
  ...
2020-06-02 13:17:23 -07:00
Bob Peterson
1a0b00d15d gfs2: Only do glock put in gfs2_create_inode for free inodes
Before this patch, the error path of function gfs2_create_inode would
always calls gfs2_glock_put for the inode glock. That's good for inodes
that are free. But after they've been added to the vfs inodes, errors
will cause the inode to be evicted, and the evict will do the glock
put for us. If we do a glock put again, we can try to free the glock
while there are still references to it, e.g. revokes pending for
the transaction that created it.

This patch adds a check: if (free_vfs_inode) before the put, thus
solving the problem.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-06-02 21:23:55 +02:00
Linus Torvalds
94709049fb Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "A few little subsystems and a start of a lot of MM patches.

  Subsystems affected by this patch series: squashfs, ocfs2, parisc,
  vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
  swap, memcg, pagemap, memory-failure, vmalloc, kasan"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits)
  kasan: move kasan_report() into report.c
  mm/mm_init.c: report kasan-tag information stored in page->flags
  ubsan: entirely disable alignment checks under UBSAN_TRAP
  kasan: fix clang compilation warning due to stack protector
  x86/mm: remove vmalloc faulting
  mm: remove vmalloc_sync_(un)mappings()
  x86/mm/32: implement arch_sync_kernel_mappings()
  x86/mm/64: implement arch_sync_kernel_mappings()
  mm/ioremap: track which page-table levels were modified
  mm/vmalloc: track which page-table levels were modified
  mm: add functions to track page directory modifications
  s390: use __vmalloc_node in stack_alloc
  powerpc: use __vmalloc_node in alloc_vm_stack
  arm64: use __vmalloc_node in arch_alloc_vmap_stack
  mm: remove vmalloc_user_node_flags
  mm: switch the test_vmalloc module to use __vmalloc_node
  mm: remove __vmalloc_node_flags_caller
  mm: remove both instances of __vmalloc_node_flags
  mm: remove the prot argument to __vmalloc_node
  mm: remove the pgprot argument to __vmalloc
  ...
2020-06-02 12:21:36 -07:00
Christoph Hellwig
88dca4ca5a mm: remove the pgprot argument to __vmalloc
The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv]
Acked-by: Gao Xiang <xiang@kernel.org> [erofs]
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:11 -07:00
Christoph Hellwig
d4efd79a81 mm: remove the prot argument from vm_map_ram
This is always PAGE_KERNEL - for long term mappings with other properties
vmap should be used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kelley <mikelley@microsoft.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200414131348.444715-19-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:11 -07:00
Huang Ying
c94b6923fa /proc/PID/smaps: Add PMD migration entry parsing
Now, when reading /proc/PID/smaps, the PMD migration entry in page table
is simply ignored.  To improve the accuracy of /proc/PID/smaps, its
parsing and processing is added.

To test the patch, we run pmbench to eat 400 MB memory in background,
then run /usr/bin/migratepages and `cat /proc/PID/smaps` every second.
The issue as follows can be reproduced within 60 seconds.

Before the patch, for the fully populated 400 MB anonymous VMA, some THP
pages under migration may be lost as below.

  7f3f6a7e5000-7f3f837e5000 rw-p 00000000 00:00 0
  Size:             409600 kB
  KernelPageSize:        4 kB
  MMUPageSize:           4 kB
  Rss:              407552 kB
  Pss:              407552 kB
  Shared_Clean:          0 kB
  Shared_Dirty:          0 kB
  Private_Clean:         0 kB
  Private_Dirty:    407552 kB
  Referenced:       301056 kB
  Anonymous:        407552 kB
  LazyFree:              0 kB
  AnonHugePages:    405504 kB
  ShmemPmdMapped:        0 kB
  FilePmdMapped:        0 kB
  Shared_Hugetlb:        0 kB
  Private_Hugetlb:       0 kB
  Swap:                  0 kB
  SwapPss:               0 kB
  Locked:                0 kB
  THPeligible:		1
  VmFlags: rd wr mr mw me ac

After the patch, it will be always,

  7f3f6a7e5000-7f3f837e5000 rw-p 00000000 00:00 0
  Size:             409600 kB
  KernelPageSize:        4 kB
  MMUPageSize:           4 kB
  Rss:              409600 kB
  Pss:              409600 kB
  Shared_Clean:          0 kB
  Shared_Dirty:          0 kB
  Private_Clean:         0 kB
  Private_Dirty:    409600 kB
  Referenced:       294912 kB
  Anonymous:        409600 kB
  LazyFree:              0 kB
  AnonHugePages:    407552 kB
  ShmemPmdMapped:        0 kB
  FilePmdMapped:        0 kB
  Shared_Hugetlb:        0 kB
  Private_Hugetlb:       0 kB
  Swap:                  0 kB
  SwapPss:               0 kB
  Locked:                0 kB
  THPeligible:		1
  VmFlags: rd wr mr mw me ac

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Link: http://lkml.kernel.org/r/20200403123059.1846960-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:10 -07:00
NeilBrown
8d92890bd6 mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead
After an NFS page has been written it is considered "unstable" until a
COMMIT request succeeds.  If the COMMIT fails, the page will be
re-written.

These "unstable" pages are currently accounted as "reclaimable", either
in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
'reclaimable' count.  This might have made sense when sending the COMMIT
required a separate action by the VFS/MM (e.g.  releasepage() used to
send a COMMIT).  However now that all writes generated by ->writepages()
will automatically be followed by a COMMIT (since commit 919e3bd9a8
("NFS: Ensure we commit after writeback is complete")) it makes more
sense to treat them as writeback pages.

So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
NR_WRITEBACK and WB_WRITEBACK.

A particular effect of this change is that when
wb_check_background_flush() calls wb_over_bg_threshold(), the latter
will report 'true' a lot less often as the 'unstable' pages are no
longer considered 'dirty' (as there is nothing that writeback can do
about them anyway).

Currently wb_check_background_flush() will trigger writeback to NFS even
when there are relatively few dirty pages (if there are lots of unstable
pages), this can result in small writes going to the server (10s of
Kilobytes rather than a Megabyte) which hurts throughput.  With this
patch, there are fewer writes which are each larger on average.

Where the NR_UNSTABLE_NFS count was included in statistics
virtual-files, the entry is retained, but the value is hard-coded as
zero.  static trace points and warning printks which mentioned this
counter no longer report it.

[akpm@linux-foundation.org: re-layout comment]
[akpm@linux-foundation.org: fix printk warning]
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Acked-by: Michal Hocko <mhocko@suse.com>	[mm]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:08 -07:00
NeilBrown
a37b0715dd mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE
PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
daemon needs to write to one bdi (the final bdi) in order to free up
writes queued to another bdi (the client bdi).

The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled.  The purpose of this is to avoid deadlock that happen when
the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
but it is being thottled and cannot write.

This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics.  This means the simple "add 25%" heuristic is no longer
reliable.

The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi
for an NFS filesystem or loop block device etc) do not affect the
throttling of the daemon writing to the "final" bdi.

This patch changes the heuristic so that the task is not throttled when
the bdi it is writing to has a dirty page count below below (or equal
to) the free-run threshold for that bdi.  This ensures it will always be
able to have some pages in flight, and so will not deadlock.

In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
still be throttled by global threshold, but that is acceptable as it is
only the deadlock state that is interesting for this flag.

This approach of "only throttle when target bdi is busy" is consistent
with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
it causes attention to be focussed only on the target bdi.

So this patch
 - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
 - removes the 25% bonus that that flag gives, and
 - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
   global and the local free-run thresholds are exceeded.

Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour
for real-time threads, so it is now different from the behaviour of nfsd
and loop tasks.  I don't know what is wanted for realtime.

[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Chuck Lever <chuck.lever@oracle.com>	[nfsd]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:08 -07:00
Guoqing Jiang
4c42be38c2 orangefs: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the
code in orangefs.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Martin Brandenburg <martin@omnibond.com>
Link: http://lkml.kernel.org/r/20200517214718.468-9-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:08 -07:00
Guoqing Jiang
14ed109e3f ntfs: replace attach_page_buffers with attach_page_private
Call the new function since attach_page_buffers will be removed.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Altaparmakov <anton@tuxera.com>
Link: http://lkml.kernel.org/r/20200517214718.468-8-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Guoqing Jiang
58aeb73196 iomap: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the
code in iomap.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Link: http://lkml.kernel.org/r/20200517214718.468-7-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Guoqing Jiang
7128cf9a25 f2fs: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the
code in f2fs.h.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Chao Yu <yuchao0@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Link: http://lkml.kernel.org/r/20200517214718.468-6-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Guoqing Jiang
45dcfc2732 fs/buffer.c: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the
code in buffer.c.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/20200517214718.468-5-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Guoqing Jiang
d1b89bc042 btrfs: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the
code in btrfs.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Link: http://lkml.kernel.org/r/20200517214718.468-4-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
9d24a13a93 iomap: convert from readpages to readahead
Use the new readahead operation in iomap.  Convert XFS and ZoneFS to use
it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-26-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
76a0294eb1 fuse: convert from readpages to readahead
Implement the new readahead operation in fuse by using __readahead_batch()
to fill the array of pages in fuse_args_pages directly.  This lets us
inline fuse_readpages_fill() into fuse_readahead().

[willy@infradead.org: build fix]
  Link: http://lkml.kernel.org/r/20200415025938.GB5820@bombadil.infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-25-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
e20a769364 f2fs: pass the inode to f2fs_mpage_readpages
This function now only uses the mapping argument to look up the inode, and
both callers already have the inode, so just pass the inode instead of the
mapping.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-24-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
2332319625 f2fs: convert from readpages to readahead
Use the new readahead operation in f2fs

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-23-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
a07f624bd6 ext4: pass the inode to ext4_mpage_readpages
This function now only uses the mapping argument to look up the inode, and
both callers already have the inode, so just pass the inode instead of the
mapping.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-22-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
6311f91f76 ext4: convert from readpages to readahead
Use the new readahead operation in ext4

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-21-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
0615090c50 erofs: convert compressed files from readpages to readahead
Use the new readahead operation in erofs.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Gao Xiang <gaoxiang25@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-20-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
0c07a9f91e erofs: convert uncompressed files from readpages to readahead
Use the new readahead operation in erofs

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Gao Xiang <gaoxiang25@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-19-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
ba206a026f btrfs: convert from readpages to readahead
Implement the new readahead method in btrfs using the new
readahead_page_batch() function.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-18-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
d4388340ae fs: convert mpage_readpages to mpage_readahead
Implement the new readahead aop and convert all callers (block_dev,
exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
reiserfs & udf).

The callers are all trivial except for GFS2 & OCFS2.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> # ocfs2
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> # ocfs2
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-17-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
2c684234d3 mm: add page_cache_readahead_unbounded
ext4 and f2fs have duplicated the guts of the readahead code so they can
read past i_size.  Instead, separate out the guts of the readahead code
so they can call it directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-14-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Jeff Layton
485e9605c0 fs/buffer.c: record blockdev write errors in super_block that it backs
When syncing out a block device (a'la __sync_blockdev), any error
encountered will only be recorded in the bd_inode's mapping.  When the
blockdev contains a filesystem however, we'd like to also record the
error in the super_block that's stored there.

Make mark_buffer_write_io_error also record the error in the
corresponding super_block when a writeback error occurs and the block
device contains a mounted superblock.

Since superblocks are RCU freed, hold the rcu_read_lock to ensure that
the superblock doesn't go away while we're marking it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-3-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:05 -07:00
Jeff Layton
735e4ae5ba vfs: track per-sb writeback errors and report them to syncfs
Patch series "vfs: have syncfs() return error when there are writeback
errors", v6.

Currently, syncfs does not return errors when one of the inodes fails to
be written back.  It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.

The basic idea in this set is to track writeback errors at the
superblock level, so that we can quickly and easily check whether
something bad happened without having to fsync each file individually.
syncfs is then changed to reliably report writeback errors after they
occur, much in the same fashion as fsync does now.

This patch (of 2):

Usually we suggest that applications call fsync when they want to ensure
that all data written to the file has made it to the backing store, but
that can be inefficient when there are a lot of open files.

Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect.  If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error.  syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.

It would be better if syncfs reported an error if there were any
writeback failures.  Then applications could call syncfs to see if there
are any errors on any open files, and could then call fsync on all of
the other descriptors to figure out which one failed.

This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.

To report those errors, we also need to keep an errseq_t in struct file
to act as a cursor.  This patch adds a dedicated field for that purpose,
which slots nicely into 4 bytes of padding at the end of struct file on
x86_64.

An earlier version of this patch used an O_PATH file descriptor to cue
the kernel that the open file should track the superblock error and not
the inode's writeback error.

I think that API is just too weird though.  This is simpler and should
make syncfs error reporting "just work" even if someone is multiplexing
fsync and syncfs on the same fds.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:05 -07:00
Gang He
912f655d78 ocfs2: mount shared volume without ha stack
Usually we create and use a ocfs2 shared volume on the top of ha stack.
For pcmk based ha stack, which includes DLM, corosync and pacemaker
services.

The customers complained they could not mount existent ocfs2 volume in
the single node without ha stack, e.g.  single node backup/restore
scenario.

Like this case, the customers just want to access the data from the
existent ocfs2 volume quickly, but do not want to restart or setup ha
stack.

Then, I'd like to add a mount option "nocluster", if the users use this
option to mount a ocfs2 shared volume, the whole mount will not depend
on the ha related services.  the command will mount the existent ocfs2
volume directly (like local mount), for avoiding setup the ha stack.

Signed-off-by: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200423053300.22661-1-ghe@suse.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:05 -07:00
Jules Irenge
8f745e62a1 ocfs2: add missing annotation for dlm_empty_lockres()
Sparse reports a warning at dlm_empty_lockres()

  warning: context imbalance in dlm_purge_lockres() - unexpected unlock

The root cause is the missing annotation at dlm_purge_lockres()

Add the missing __must_hold(&dlm->spinlock)

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200403160505.2832-4-jbi.octave@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:05 -07:00
Philippe Liard
93e72b3c61 squashfs: migrate from ll_rw_block usage to BIO
ll_rw_block() function has been deprecated in favor of BIO which appears
to come with large performance improvements.

This patch decreases boot time by close to 40% when using squashfs for
the root file-system.  This is observed at least in the context of
starting an Android VM on Chrome OS using crosvm.  The patch was tested
on 4.19 as well as master.

This patch is largely based on Adrien Schildknecht's patch that was
originally sent as https://lkml.org/lkml/2017/9/22/814 though with some
significant changes and simplifications while also taking Phillip
Lougher's feedback into account, around preserving support for
FILE_CACHE in particular.

[akpm@linux-foundation.org: fix build error reported by Randy]
  Link: http://lkml.kernel.org/r/319997c2-5fc8-f889-2ea3-d913308a7c1f@infradead.org
Signed-off-by: Philippe Liard <pliard@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Adrien Schildknecht <adrien+dev@schischi.me>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Daniel Rosenberg <drosen@google.com>
Link: https://chromium.googlesource.com/chromiumos/platform/crosvm
Link: http://lkml.kernel.org/r/20191106074238.186023-1-pliard@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:05 -07:00
Bob Peterson
ea22eee4e6 gfs2: Allow lock_nolock mount to specify jid=X
Before this patch, a simple typo accidentally added \n to the jid=
string for lock_nolock mounts. This made it impossible to mount a
gfs2 file system with a journal other than journal0. Thus:

mount -tgfs2 -o hostdata="jid=1" <device> <mount pt>

Resulted in:
mount: wrong fs type, bad option, bad superblock on <device>

In most cases this is not a problem. However, for debugging and
testing purposes we sometimes want to test the integrity of other
journals. This patch removes the unnecessary \n and thus allows
lock_nolock users to specify an alternate journal.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-06-02 19:45:05 +02:00
Bob Peterson
bbae10fac2 gfs2: Don't ignore inode write errors during inode_go_sync
Before for this patch, function inode_go_sync ignored io errors
during inode_go_sync, overwriting them with metadata write errors:

		error = filemap_fdatawait(mapping);
		mapping_set_error(mapping, error);
	}
	error = filemap_fdatawait(metamapping);
	...
	return error;

So any errors returned by the inode write would be forgotten if the
metadata write succeeded. This patch still does both writes, but
only sets error if it's still zero. That way, any errors will be
reported by to the caller, do_xmote, which will take appropriate
action and report the error.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-06-02 19:45:05 +02:00
Aurelien Aptel
5f68ea4aa9 cifs: multichannel: move channel selection in function
This commit moves channel picking code in separate function.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-02 09:58:41 -05:00