Commit Graph

58702 Commits

Author SHA1 Message Date
Linus Torvalds
700a800a94 This pull consists mostly of nfsd container work:
Scott Mayhew revived an old api that communicates with a userspace
 daemon to manage some on-disk state that's used to track clients across
 server reboots.  We've been using a usermode_helper upcall for that, but
 it's tough to run those with the right namespaces, so a daemon is much
 friendlier to container use cases.
 
 Trond fixed nfsd's handling of user credentials in user namespaces.  He
 also contributed patches that allow containers to support different sets
 of NFS protocol versions.
 
 The only remaining container bug I'm aware of is that the NFS reply
 cache is shared between all containers.  If anyone's aware of other gaps
 in our container support, let me know.
 
 The rest of this is miscellaneous bugfixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAlzcWNcVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+DUEP/0WD3jKNAHFV3M5YQPAI9fz/iCND
 Db/A4oWP5qa6JmwmHe61il29QeGqkeFr/NPexgzM3Xw2E39d7RBXBeWyVDuqb0wr
 6SCXjXibTsuAHg11nR8Xf0P5Vej3rfGbG6up5lLCIDTEZxVpWoaBJnM8+3bewuCj
 XbeiDW54oiMbmDjon3MXqVAIF/z7LjorecJ+Yw5+0Jy7KZ6num9Kt8+fi7qkEfFd
 i5Bp9KWgzlTbJUJV4EX3ZKN3zlGkfOvjoo2kP3PODPVMB34W8jSLKkRSA1tDWYZg
 43WhBt5OODDlV6zpxSJXehYKIB4Ae469+RRaIL4F+ORRK+AzR0C/GTuOwJiG+P3J
 n95DX5WzX74nPOGQJgAvq4JNpZci85jM3jEK1TR2M7KiBDG5Zg+FTsPYVxx5Sgah
 Akl/pjLtHQPSdBbFGHn5TsXU+gqWNiKsKa9663tjxLb8ldmJun6JoQGkAEF9UJUn
 dzv0UxyHeHAblhSynY+WsUR+Xep9JDo/p5LyFK4if9Sd62KeA1uF/MFhAqpKZF81
 mrgRCqW4sD8aVTBNZI06pZzmcZx4TRr2o+Oj5KAXf6Yk6TJRSGfnQscoMMBsTLkw
 VK1rBQ/71TpjLHGZZZEx1YJrkVZAMmw2ty4DtK2f9jeKO13bWmUpc6UATzVufHKA
 C1rUZXJ5YioDbYDy
 =TUdw
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.2' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "This consists mostly of nfsd container work:

  Scott Mayhew revived an old api that communicates with a userspace
  daemon to manage some on-disk state that's used to track clients
  across server reboots. We've been using a usermode_helper upcall for
  that, but it's tough to run those with the right namespaces, so a
  daemon is much friendlier to container use cases.

  Trond fixed nfsd's handling of user credentials in user namespaces. He
  also contributed patches that allow containers to support different
  sets of NFS protocol versions.

  The only remaining container bug I'm aware of is that the NFS reply
  cache is shared between all containers. If anyone's aware of other
  gaps in our container support, let me know.

  The rest of this is miscellaneous bugfixes"

* tag 'nfsd-5.2' of git://linux-nfs.org/~bfields/linux: (23 commits)
  nfsd: update callback done processing
  locks: move checks from locks_free_lock() to locks_release_private()
  nfsd: fh_drop_write in nfsd_unlink
  nfsd: allow fh_want_write to be called twice
  nfsd: knfsd must use the container user namespace
  SUNRPC: rsi_parse() should use the current user namespace
  SUNRPC: Fix the server AUTH_UNIX userspace mappings
  lockd: Pass the user cred from knfsd when starting the lockd server
  SUNRPC: Temporary sockets should inherit the cred from their parent
  SUNRPC: Cache the process user cred in the RPC server listener
  nfsd: Allow containers to set supported nfs versions
  nfsd: Add custom rpcbind callbacks for knfsd
  SUNRPC: Allow further customisation of RPC program registration
  SUNRPC: Clean up generic dispatcher code
  SUNRPC: Add a callback to initialise server requests
  SUNRPC/nfs: Fix return value for nfs4_callback_compound()
  nfsd: handle legacy client tracking records sent by nfsdcld
  nfsd: re-order client tracking method selection
  nfsd: keep a tally of RECLAIM_COMPLETE operations when using nfsdcld
  nfsd: un-deprecate nfsdcld
  ...
2019-05-15 18:21:43 -07:00
Sabyasachi Gupta
3813393f5a fs/block_dev.c: Remove duplicate header
linux/dax.h is included more than once.

Link: http://lkml.kernel.org/r/5c867e95.1c69fb81.4f15a.e5e4@mx.google.com
Signed-off-by: Sabyasachi Gupta <sabyasachi.linux@gmail.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:52 -07:00
Sabyasachi Gupta
081d7d35fb fs/cachefiles/namei.c: remove duplicate header
linux/xattr.h is included more than once.

Link: http://lkml.kernel.org/r/5c86803d.1c69fb81.1a7c6.2b78@mx.google.com
Signed-off-by: Sabyasachi Gupta <sabyasachi.linux@gmail.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:52 -07:00
Sabyasachi Gupta
10bcba8c16 fs/coda/psdev.c: remove duplicate header
linux/poll.h is included more than once.

Link: http://lkml.kernel.org/r/5c86820f.1c69fb81.149f0.0834@mx.google.com
Signed-off-by: Sabyasachi Gupta <sabyasachi.linux@gmail.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:52 -07:00
YueHaibing
ce528c4c20 fs/eventfd.c: make eventfd_ida static
Fix sparse warning:

fs/eventfd.c:26:1: warning:
 symbol 'eventfd_ida' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20190413142348.34716-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:51 -07:00
Masatake YAMATO
b556db17b0 eventfd: present id to userspace via fdinfo
Finding endpoints of an IPC channel is one of essential task to
understand how a user program works.  Procfs and netlink socket provide
enough hints to find endpoints for IPC channels like pipes, unix
sockets, and pseudo terminals.  However, there is no simple way to find
endpoints for an eventfd file from userland.  An inode number doesn't
hint.  Unlike pipe, all eventfd files share the same inode object.

To provide the way to find endpoints of an eventfd file, this patch adds
"eventfd-id" field to /proc/PID/fdinfo of eventfd as identifier.
Integers managed by an IDA are used as ids.

A tool like lsof can utilize the information to print endpoints.

Link: http://lkml.kernel.org/r/20190327181823.20222-1-yamato@redhat.com
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:51 -07:00
Alexey Dobriyan
d53ddd0181 fs/exec.c: move ->recursion_depth out of critical sections
->recursion_depth is changed only by current, therefore decrementing can
be done without taking any locks.

Link: http://lkml.kernel.org/r/20190417213150.GA26474@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:50 -07:00
Hou Tao
bd8309de0d fs/fat/file.c: issue flush after the writeback of FAT
fsync() needs to make sure the data & meta-data of file are persistent
after the return of fsync(), even when a power-failure occurs later.  In
the case of fat-fs, the FAT belongs to the meta-data of file, so we need
to issue a flush after the writeback of FAT instead before.

Also bail out early when any stage of fsync fails.

Link: http://lkml.kernel.org/r/20190409030158.136316-1-houtao1@huawei.com
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:50 -07:00
Bharath Vedartham
672cdd56f0 reiserfs: add comment to explain endianness issue in xattr_hash
csum_partial() gives different results for little-endian and big-endian
hosts.  This causes images created on little-endian hosts and mounted on
big endian hosts to see csum mismatches.  This causes an endianness bug.
Sparse gives a warning as csum_partial returns a restricted integer type
__wsum_t and xattr_hash expects __u32.  This warning acts as a reminder
for this bug and should not be suppressed.

This comment aims to convey these endianness issues.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20190423161831.GA15387@bharath12345-Inspiron-5559
Signed-off-by: Bharath Vedartham <linux.bhar@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:50 -07:00
Kees Cook
bbdc6076d2 binfmt_elf: move brk out of mmap when doing direct loader exec
Commmit eab09532d4 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"),
made changes in the rare case when the ELF loader was directly invoked
(e.g to set a non-inheritable LD_LIBRARY_PATH, testing new versions of
the loader), by moving into the mmap region to avoid both ET_EXEC and
PIE binaries.  This had the effect of also moving the brk region into
mmap, which could lead to the stack and brk being arbitrarily close to
each other.  An unlucky process wouldn't get its requested stack size
and stack allocations could end up scribbling on the heap.

This is illustrated here.  In the case of using the loader directly, brk
(so helpfully identified as "[heap]") is allocated with the _loader_ not
the binary.  For example, with ASLR entirely disabled, you can see this
more clearly:

$ /bin/cat /proc/self/maps
555555554000-55555555c000 r-xp 00000000 ... /bin/cat
55555575b000-55555575c000 r--p 00007000 ... /bin/cat
55555575c000-55555575d000 rw-p 00008000 ... /bin/cat
55555575d000-55555577e000 rw-p 00000000 ... [heap]
...
7ffff7ff7000-7ffff7ffa000 r--p 00000000 ... [vvar]
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 ... [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffd000-7ffff7ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 ...
7ffffffde000-7ffffffff000 rw-p 00000000 ... [stack]

$ /lib/x86_64-linux-gnu/ld-2.27.so /bin/cat /proc/self/maps
...
7ffff7bcc000-7ffff7bd4000 r-xp 00000000 ... /bin/cat
7ffff7bd4000-7ffff7dd3000 ---p 00008000 ... /bin/cat
7ffff7dd3000-7ffff7dd4000 r--p 00007000 ... /bin/cat
7ffff7dd4000-7ffff7dd5000 rw-p 00008000 ... /bin/cat
7ffff7dd5000-7ffff7dfc000 r-xp 00000000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7fb2000-7ffff7fd6000 rw-p 00000000 ...
7ffff7ff7000-7ffff7ffa000 r--p 00000000 ... [vvar]
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 ... [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffd000-7ffff7ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffe000-7ffff8020000 rw-p 00000000 ... [heap]
7ffffffde000-7ffffffff000 rw-p 00000000 ... [stack]

The solution is to move brk out of mmap and into ELF_ET_DYN_BASE since
nothing is there in the direct loader case (and ET_EXEC is still far
away at 0x400000).  Anything that ran before should still work (i.e.
the ultimately-launched binary already had the brk very far from its
text, so this should be no different from a COMPAT_BRK standpoint).  The
only risk I see here is that if someone started to suddenly depend on
the entire memory space lower than the mmap region being available when
launching binaries via a direct loader execs which seems highly
unlikely, I'd hope: this would mean a binary would _not_ work when
exec()ed normally.

(Note that this is only done under CONFIG_ARCH_HAS_ELF_RANDOMIZATION
when randomization is turned on.)

Link: http://lkml.kernel.org/r/20190422225727.GA21011@beast
Link: https://lkml.kernel.org/r/CAGXu5jJ5sj3emOT2QPxQkNQk0qbU6zEfu9=Omfhx_p0nCKPSjA@mail.gmail.com
Fixes: eab09532d4 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Ali Saidi <alisaidi@amazon.com>
Cc: Ali Saidi <alisaidi@amazon.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:50 -07:00
Alexey Dobriyan
249b08e4e5 elf: init pt_regs pointer later
Get "current_pt_regs" pointer right before usage.

Space savings on x86_64:

	add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-180 (-180)
	Function                           old     new   delta
	load_elf_binary                   5806    5626    -180 !!!

Looks like the compiler doesn't know that "current_pt_regs" is stable
pointer (because it doesn't know ->stack isn't) even though it knows
that "current" is stable pointer.  So it saves it in the very beginning
and then tries to carry it through a lot of code.

Here is what happens here:

load_elf_binary()
		...
	mov	rax,QWORD PTR gs:0x14c00
	mov	r13,QWORD PTR [rax+0x18]	r13 = current->stack
	call	kmem_cache_alloc		# first kmalloc

		[980 bytes later!]

	# let's spill that sucker because we need a register
	# for "load_bias" calculations at
	#
	#	if (interpreter) {
	#		load_bias = ELF_ET_DYN_BASE;
	#		if (current->flags & PF_RANDOMIZE)
	#			load_bias += arch_mmap_rnd();
	#		elf_flags |= elf_fixed;
	#	}
	mov	QWORD PTR [rsp+0x68],r13

If this is not _the_ root cause it is still eeeeh.

After the patch things become much simpler:

	mov	rax, QWORD PTR gs:0x14c00	# current
	mov	rdx, QWORD PTR [rax+0x18]	# current->stack
	movq	[rdx+0x3fb8], 0			# fill pt_regs
		...
	call finalize_exec

Link: http://lkml.kernel.org/r/20190419200343.GA19788@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Tested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:50 -07:00
Alexey Dobriyan
d8e7cb39ac fs/binfmt_elf.c: extract PROT_* calculations
There are two places where mapping protections are calculated: one for
executable, another one for interpreter -- take them out.

ELF read and execute permissions are interchanged with Linux PROT_READ
and PROT_EXEC, microoptimizations are welcome!

Link: http://lkml.kernel.org/r/20190417213413.GB26474@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:50 -07:00
Alexey Dobriyan
852643165a fs//binfmt_elf.c: move variables initialization closer to their usage
Link: http://lkml.kernel.org/r/20190416202002.GB24304@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:50 -07:00
Alexey Dobriyan
be0deb585e fs/binfmt_elf.c: save 1 indent level
Rewrite

	for (...) {
		if (->p_type == PT_INTERP) {
			...
			break;
		}
	}

loop into

	for (...) {
		if (->p_type != PT_INTERP)
			continue;
		...
		break;
	}

Link: http://lkml.kernel.org/r/20190416201906.GA24304@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:50 -07:00
Alexey Dobriyan
ba0f6b88a8 fs/binfmt_elf.c: delete trailing "return;" in functions returning "void"
Link: http://lkml.kernel.org/r/20190314205042.GE18143@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:50 -07:00
Alexey Dobriyan
cc338010a2 fs/binfmt_elf.c: free PT_INTERP filename ASAP
There is no reason for PT_INTERP filename to linger till the end of the
whole loading process.

Link: http://lkml.kernel.org/r/20190314204953.GD18143@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Nikitas Angelinas <nikitas.angelinas@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mukesh Ojha <mojha@codeaurora.org>
[nikitas.angelinas@gmail.com: fix GPF when dereferencing invalid interpreter]
  Link: http://lkml.kernel.org/r/20190330140032.GA1527@vostro
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:50 -07:00
Alexey Dobriyan
5cf4a36382 fs/binfmt_elf.c: make scope of "pos" variable smaller
Link: http://lkml.kernel.org/r/20190314204707.GC18143@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:49 -07:00
Andrew Morton
22f084dbc1 fs/binfmt_elf.c: remove unneeded initialization of mm->start_stack
As pointed out by zoujc@lenovo.com, setup_arg_pages() already
initialized current->mm->start_stack.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=202881
Reported-by: <zoujc@lenovo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:49 -07:00
Lin Feng
e02c9b0d65 kernel/latencytop.c: rename clear_all_latency_tracing to clear_tsk_latency_tracing
The name clear_all_latency_tracing is misleading, in fact which only
clear per task's latency_record[], and we do have another function named
clear_global_latency_tracing which clear the global latency_record[]
buffer.

Link: http://lkml.kernel.org/r/20190226114602.16902-1-linf@wangsu.com
Signed-off-by: Lin Feng <linf@wangsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:49 -07:00
Linus Torvalds
318222a35b Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few misc things and hotfixes

 - ocfs2

 - almost all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (139 commits)
  kernel/memremap.c: remove the unused device_private_entry_fault() export
  mm: delete find_get_entries_tag
  mm/huge_memory.c: make __thp_get_unmapped_area static
  mm/mprotect.c: fix compilation warning because of unused 'mm' variable
  mm/page-writeback: introduce tracepoint for wait_on_page_writeback()
  mm/vmscan: simplify trace_reclaim_flags and trace_shrink_flags
  mm/Kconfig: update "Memory Model" help text
  mm/vmscan.c: don't disable irq again when count pgrefill for memcg
  mm: memblock: make keeping memblock memory opt-in rather than opt-out
  hugetlbfs: always use address space in inode for resv_map pointer
  mm/z3fold.c: support page migration
  mm/z3fold.c: add structure for buddy handles
  mm/z3fold.c: improve compression by extending search
  mm/z3fold.c: introduce helper functions
  mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist
  mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig
  mm/vmscan.c: simplify shrink_inactive_list()
  fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
  xen/privcmd-buf.c: convert to use vm_map_pages_zero()
  xen/gntdev.c: convert to use vm_map_pages()
  ...
2019-05-14 10:10:55 -07:00
Mike Kravetz
f27a5136f7 hugetlbfs: always use address space in inode for resv_map pointer
Continuing discussion about 58b6e5e8f1 ("hugetlbfs: fix memory leak for
resv_map") brought up the issue that inode->i_mapping may not point to the
address space embedded within the inode at inode eviction time.  The
hugetlbfs truncate routine handles this by explicitly using inode->i_data.
However, code cleaning up the resv_map will still use the address space
pointed to by inode->i_mapping.  Luckily, private_data is NULL for address
spaces in all such cases today but, there is no guarantee this will
continue.

Change all hugetlbfs code getting a resv_map pointer to explicitly get it
from the address space embedded within the inode.  In addition, add more
comments in the code to indicate why this is being done.

Link: http://lkml.kernel.org/r/20190419204435.16984-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Yufen Yu <yuyufen@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:50 -07:00
Amir Goldstein
c553ea4fdf fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
23d0127096 ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
writeback") claims that sync_file_range(2) syscall was "created for
userspace to be able to issue background writeout and so waiting for
in-flight IO is undesirable there" and changes the writeback (back) to
WB_SYNC_NONE.

This claim is only partially true.  It is true for users that use the flag
SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was the
reason for changing to WB_SYNC_NONE writeback.

However, that claim is not true for users that use that flag combination
SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.  Those users explicitly
requested to wait for in-flight IO as well as to writeback of dirty pages.

Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use
WB_SYNC_ALL writeback to perform the full range sync request.

Link: http://lkml.kernel.org/r/20190409114922.30095-1-amir73il@gmail.com
Link: http://lkml.kernel.org/r/20190419072938.31320-1-amir73il@gmail.com
Fixes: 23d0127096 ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Jan Kara <jack@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:50 -07:00
Jérôme Glisse
7269f99993 mm/mmu_notifier: use correct mmu_notifier events for each invalidation
This updates each existing invalidation to use the correct mmu notifier
event that represent what is happening to the CPU page table.  See the
patch which introduced the events to see the rational behind this.

Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00
Jérôme Glisse
6f4f13e8d9 mm/mmu_notifier: contextual information for event triggering invalidation
CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).

Users of mmu notifier API track changes to the CPU page table and take
specific action for them.  While current API only provide range of virtual
address affected by the change, not why the changes is happening.

This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma).  Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.

The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens.  A latter patch do convert mm call path to use a more appropriate
events for each call.

This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:

%<----------------------------------------------------------------------
@@
identifier I1, I2, I3, I4;
@@
static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
+enum mmu_notifier_event event,
+unsigned flags,
+struct vm_area_struct *vma,
struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }

@@
@@
-#define mmu_notifier_range_init(range, mm, start, end)
+#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)

@@
expression E1, E3, E4;
identifier I1;
@@
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, I1,
I1->vm_mm, E3, E4)
...>

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }

@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, NULL,
E2, E3, E4)
...> }
---------------------------------------------------------------------->%

Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-place

Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00
Mike Kravetz
1b426bac66 hugetlb: use same fault hash key for shared and private mappings
hugetlb uses a fault mutex hash table to prevent page faults of the
same pages concurrently.  The key for shared and private mappings is
different.  Shared keys off address_space and file index.  Private keys
off mm and virtual address.  Consider a private mappings of a populated
hugetlbfs file.  A fault will map the page from the file and if needed
do a COW to map a writable page.

Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
pages.  It uses the address_space file index key.  However, private
mappings will use a different key and could race with this code to map
the file page.  This causes problems (BUG) for the page cache remove
code as it expects the page to be unmapped.  A sample stack is:

page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
kernel BUG at mm/filemap.c:169!
...
RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
...
Call Trace:
__delete_from_page_cache+0x39/0x220
delete_from_page_cache+0x45/0x70
remove_inode_hugepages+0x13c/0x380
? __add_to_page_cache_locked+0x162/0x380
hugetlbfs_fallocate+0x403/0x540
? _cond_resched+0x15/0x30
? __inode_security_revalidate+0x5d/0x70
? selinux_file_permission+0x100/0x130
vfs_fallocate+0x13f/0x270
ksys_fallocate+0x3c/0x80
__x64_sys_fallocate+0x1a/0x20
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9

There seems to be another potential COW issue/race with this approach
of different private and shared keys as noted in commit 8382d914eb
("mm, hugetlb: improve page-fault scalability").

Since every hugetlb mapping (even anon and private) is actually a file
mapping, just use the address_space index key for all mappings.  This
results in potentially more hash collisions.  However, this should not
be the common case.

Link: http://lkml.kernel.org/r/20190328234704.27083-3-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20190412165235.t4sscoujczfhuiyt@linux-r8p5
Fixes: b5cec28d36 ("hugetlbfs: truncate_hugepages() takes a range of pages")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:48 -07:00
Aneesh Kumar K.V
024eee0e83 mm: page_mkclean vs MADV_DONTNEED race
MADV_DONTNEED is handled with mmap_sem taken in read mode.  We call
page_mkclean without holding mmap_sem.

MADV_DONTNEED implies that pages in the region are unmapped and subsequent
access to the pages in that range is handled as a new page fault.  This
implies that if we don't have parallel access to the region when
MADV_DONTNEED is run we expect those range to be unallocated.

w.r.t page_mkclean() we need to make sure that we don't break the
MADV_DONTNEED semantics.  MADV_DONTNEED check for pmd_none without holding
pmd_lock.  This implies we skip the pmd if we temporarily mark pmd none.
Avoid doing that while marking the page clean.

Keep the sequence same for dax too even though we don't support
MADV_DONTNEED for dax mapping

The bug was noticed by code review and I didn't observe any failures w.r.t
test run.  This is similar to

commit 58ceeb6bec
Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Date:   Thu Apr 13 14:56:26 2017 -0700

    thp: fix MADV_DONTNEED vs. MADV_FREE race

commit ced108037c
Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Date:   Thu Apr 13 14:56:20 2017 -0700

    thp: fix MADV_DONTNEED vs. numa balancing race

Link: http://lkml.kernel.org/r/20190321040610.14226-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc:"Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:48 -07:00
Ira Weiny
73b0140bf0 mm/gup: change GUP fast to use flags rather than a write 'bool'
To facilitate additional options to get_user_pages_fast() change the
singular write parameter to be gup_flags.

This patch does not change any functionality.  New functionality will
follow in subsequent patches.

Some of the get_user_pages_fast() call sites were unchanged because they
already passed FOLL_WRITE or 0 for the write parameter.

NOTE: It was suggested to change the ordering of the get_user_pages_fast()
arguments to ensure that callers were converted.  This breaks the current
GUP call site convention of having the returned pages be the final
parameter.  So the suggestion was rejected.

Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marshall <hubcap@omnibond.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:46 -07:00
Ira Weiny
932f4a630a mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM
Pach series "Add FOLL_LONGTERM to GUP fast and use it".

HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages.  These pages can be held for a significant time.  But
get_user_pages_fast() does not protect against mapping FS DAX pages.

Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks.  XDP has also
shown interest in using this functionality.[1]

In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.

[1] https://lkml.org/lkml/2019/3/19/939

"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move.  I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem.  Then I think we can change the flag to a
better name.

Secondly, it depends on how often you are registering memory.  I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance.  I don't have the numbers as the
tests for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

This patch (of 7):

This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast().  Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.

Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.

This patch does not change any functionality.  In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked.  However, callers of get_user_pages_fast()
were not "protected".

FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.

NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages.  This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.

As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.

In review[1] it was asked:

<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?

A couple of points.

First "longterm" is a relative thing and at this point is probably a
misnomer.  This is really flagging a pin which is going to be given to
hardware and can't move.  I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem.  Then I think we can
change the flag to a better name.

Second, It depends on how often you are registering memory.  I have spoken
with some RDMA users who consider MR in the performance path...  For the
overall application performance.  I don't have the numbers as the tests
for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

</quote>

[1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965

[ira.weiny@intel.com: v3]
  Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:45 -07:00
Peter Xu
cefdca0a86 userfaultfd/sysctl: add vm.unprivileged_userfaultfd
Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available.  By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit.  While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.

We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.

Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users.  When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.

Andrea said:

: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd.  In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.

[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:45 -07:00
Shuning Zhang
e091eab028 ocfs2: fix ocfs2 read inode data panic in ocfs2_iget
In some cases, ocfs2_iget() reads the data of inode, which has been
deleted for some reason.  That will make the system panic.  So We should
judge whether this inode has been deleted, and tell the caller that the
inode is a bad inode.

For example, the ocfs2 is used as the backed of nfs, and the client is
nfsv3.  This issue can be reproduced by the following steps.

on the nfs server side,
..../patha/pathb

Step 1: The process A was scheduled before calling the function fh_verify.

Step 2: The process B is removing the 'pathb', and just completed the call
to function dput.  Then the dentry of 'pathb' has been deleted from the
dcache, and all ancestors have been deleted also.  The relationship of
dentry and inode was deleted through the function hlist_del_init.  The
following is the call stack.
dentry_iput->hlist_del_init(&dentry->d_u.d_alias)

At this time, the inode is still in the dcache.

Step 3: The process A call the function ocfs2_get_dentry, which get the
inode from dcache.  Then the refcount of inode is 1.  The following is the
call stack.
nfsd3_proc_getacl->fh_verify->exportfs_decode_fh->fh_to_dentry(ocfs2_get_dentry)

Step 4: Dirty pages are flushed by bdi threads.  So the inode of 'patha'
is evicted, and this directory was deleted.  But the inode of 'pathb'
can't be evicted, because the refcount of the inode was 1.

Step 5: The process A keep running, and call the function
reconnect_path(in exportfs_decode_fh), which call function
ocfs2_get_parent of ocfs2.  Get the block number of parent
directory(patha) by the name of ...  Then read the data from disk by the
block number.  But this inode has been deleted, so the system panic.

Process A                                             Process B
1. in nfsd3_proc_getacl                   |
2.                                        |        dput
3. fh_to_dentry(ocfs2_get_dentry)         |
4. bdi flush dirty cache                  |
5. ocfs2_iget                             |

[283465.542049] OCFS2: ERROR (device sdp): ocfs2_validate_inode_block:
Invalid dinode #580640: OCFS2_VALID_FL not set

[283465.545490] Kernel panic - not syncing: OCFS2: (device sdp): panic forced
after error

[283465.546889] CPU: 5 PID: 12416 Comm: nfsd Tainted: G        W
4.1.12-124.18.6.el6uek.bug28762940v3.x86_64 #2
[283465.548382] Hardware name: VMware, Inc. VMware Virtual Platform/440BX
Desktop Reference Platform, BIOS 6.00 09/21/2015
[283465.549657]  0000000000000000 ffff8800a56fb7b8 ffffffff816e839c
ffffffffa0514758
[283465.550392]  000000000008dc20 ffff8800a56fb838 ffffffff816e62d3
0000000000000008
[283465.551056]  ffff880000000010 ffff8800a56fb848 ffff8800a56fb7e8
ffff88005df9f000
[283465.551710] Call Trace:
[283465.552516]  [<ffffffff816e839c>] dump_stack+0x63/0x81
[283465.553291]  [<ffffffff816e62d3>] panic+0xcb/0x21b
[283465.554037]  [<ffffffffa04e66b0>] ocfs2_handle_error+0xf0/0xf0 [ocfs2]
[283465.554882]  [<ffffffffa04e7737>] __ocfs2_error+0x67/0x70 [ocfs2]
[283465.555768]  [<ffffffffa049c0f9>] ocfs2_validate_inode_block+0x229/0x230
[ocfs2]
[283465.556683]  [<ffffffffa047bcbc>] ocfs2_read_blocks+0x46c/0x7b0 [ocfs2]
[283465.557408]  [<ffffffffa049bed0>] ? ocfs2_inode_cache_io_unlock+0x20/0x20
[ocfs2]
[283465.557973]  [<ffffffffa049f0eb>] ocfs2_read_inode_block_full+0x3b/0x60
[ocfs2]
[283465.558525]  [<ffffffffa049f5ba>] ocfs2_iget+0x4aa/0x880 [ocfs2]
[283465.559082]  [<ffffffffa049146e>] ocfs2_get_parent+0x9e/0x220 [ocfs2]
[283465.559622]  [<ffffffff81297c05>] reconnect_path+0xb5/0x300
[283465.560156]  [<ffffffff81297f46>] exportfs_decode_fh+0xf6/0x2b0
[283465.560708]  [<ffffffffa062faf0>] ? nfsd_proc_getattr+0xa0/0xa0 [nfsd]
[283465.561262]  [<ffffffff810a8196>] ? prepare_creds+0x26/0x110
[283465.561932]  [<ffffffffa0630860>] fh_verify+0x350/0x660 [nfsd]
[283465.562862]  [<ffffffffa0637804>] ? nfsd_cache_lookup+0x44/0x630 [nfsd]
[283465.563697]  [<ffffffffa063a8b9>] nfsd3_proc_getattr+0x69/0xf0 [nfsd]
[283465.564510]  [<ffffffffa062cf60>] nfsd_dispatch+0xe0/0x290 [nfsd]
[283465.565358]  [<ffffffffa05eb892>] ? svc_tcp_adjust_wspace+0x12/0x30
[sunrpc]
[283465.566272]  [<ffffffffa05ea652>] svc_process_common+0x412/0x6a0 [sunrpc]
[283465.567155]  [<ffffffffa05eaa03>] svc_process+0x123/0x210 [sunrpc]
[283465.568020]  [<ffffffffa062c90f>] nfsd+0xff/0x170 [nfsd]
[283465.568962]  [<ffffffffa062c810>] ? nfsd_destroy+0x80/0x80 [nfsd]
[283465.570112]  [<ffffffff810a622b>] kthread+0xcb/0xf0
[283465.571099]  [<ffffffff810a6160>] ? kthread_create_on_node+0x180/0x180
[283465.572114]  [<ffffffff816f11b8>] ret_from_fork+0x58/0x90
[283465.573156]  [<ffffffff810a6160>] ? kthread_create_on_node+0x180/0x180

Link: http://lkml.kernel.org/r/1554185919-3010-1-git-send-email-sunny.s.zhang@oracle.com
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: piaojun <piaojun@huawei.com>
Cc: "Gang He" <ghe@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:44 -07:00
Phillip Potter
9dc2108d66 ocfs2: use common file type conversion
Deduplicate the ocfs2 file type conversion implementation and remove
OCFS2_FT_* definitions - file systems that use the same file types as
defined by POSIX do not need to define their own versions and can use the
common helper functions decared in fs_types.h and implemented in
fs_types.c

Common implementation can be found via bbe7449e25 ("fs: common
implementation of file type").

Link: http://lkml.kernel.org/r/20190326213919.GA20878@pathfinder
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:44 -07:00
Dan Williams
fce86ff580 mm/huge_memory: fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned addresses
Starting with c6f3c5ee40 ("mm/huge_memory.c: fix modifying of page
protection by insert_pfn_pmd()") vmf_insert_pfn_pmd() internally calls
pmdp_set_access_flags().  That helper enforces a pmd aligned @address
argument via VM_BUG_ON() assertion.

Update the implementation to take a 'struct vm_fault' argument directly
and apply the address alignment fixup internally to fix crash signatures
like:

    kernel BUG at arch/x86/mm/pgtable.c:515!
    invalid opcode: 0000 [#1] SMP NOPTI
    CPU: 51 PID: 43713 Comm: java Tainted: G           OE     4.19.35 #1
    [..]
    RIP: 0010:pmdp_set_access_flags+0x48/0x50
    [..]
    Call Trace:
     vmf_insert_pfn_pmd+0x198/0x350
     dax_iomap_fault+0xe82/0x1190
     ext4_dax_huge_fault+0x103/0x1f0
     ? __switch_to_asm+0x40/0x70
     __handle_mm_fault+0x3f6/0x1370
     ? __switch_to_asm+0x34/0x70
     ? __switch_to_asm+0x40/0x70
     handle_mm_fault+0xda/0x200
     __do_page_fault+0x249/0x4f0
     do_page_fault+0x32/0x110
     ? page_fault+0x8/0x30
     page_fault+0x1e/0x30

Link: http://lkml.kernel.org/r/155741946350.372037.11148198430068238140.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: c6f3c5ee40 ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Piotr Balcer <piotr.balcer@intel.com>
Tested-by: Yan Ma <yan.ma@intel.com>
Tested-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Chandan Rajendra <chandan@linux.ibm.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:44 -07:00
Linus Torvalds
7e9890a350 overlayfs update for 5.2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXNpu6gAKCRDh3BK/laaZ
 PNYnAQCLMJBZp9AVKU+5onOGKLmgUfnbKZhWJYICW6DVKobo6AEA4aXBIk5TIDiu
 +a3Ny0nAutdpHcRkbi8jJty91BeJgQg=
 =iLSD
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs update from Miklos Szeredi:
 "Just bug fixes in this small update"

* tag 'ovl-update-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: relax WARN_ON() for overlapping layers use case
  ovl: check the capability before cred overridden
  ovl: do not generate duplicate fsnotify events for "fake" path
  ovl: support stacked SEEK_HOLE/SEEK_DATA
  ovl: fix missing upper fs freeze protection on copy up for ioctl
2019-05-14 09:02:14 -07:00
Linus Torvalds
4856118f49 fuse update for 5.2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXNpuRwAKCRDh3BK/laaZ
 PMq/AP9kLvB97JU2GbzIJq6wOjDV8whPE/a2Knx0fajvW3AEOAD+NQwdZLmVNql7
 DkkY8lZ7fVut3TMj8jHhpIbv4P1R1AE=
 =qX6f
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse update from Miklos Szeredi:
 "Add more caching controls for userspace filesystems to use, as well as
  bug fixes and cleanups"

* tag 'fuse-update-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: clean up fuse_alloc_inode
  fuse: Add ioctl flag for x32 compat ioctl
  fuse: Convert fusectl to use the new mount API
  fuse: fix changelog entry for protocol 7.9
  fuse: fix changelog entry for protocol 7.12
  fuse: document fuse_fsync_in.fsync_flags
  fuse: Add FOPEN_STREAM to use stream_open()
  fuse: require /dev/fuse reads to have enough buffer capacity
  fuse: retrieve: cap requested size to negotiated max_write
  fuse: allow filesystems to have precise control over data cache
  fuse: convert printk -> pr_*
  fuse: honor RLIMIT_FSIZE in fuse_file_fallocate
  fuse: fix writepages on 32bit
2019-05-14 08:59:14 -07:00
Linus Torvalds
0d28544117 f2fs-for-5.2-rc1
Another round of various bug fixes came in. Damien improved SMR drive support a
 bit, and Chao replaced BUG_ON() with reporting errors to user since we've not
 hit from users but did hit from crafted images. We've found a disk layout bug
 in large_nat_bits feature which supports very large NAT entries enabled at mkfs.
 If the feature is enabled, it will give a notice to run fsck to correct the
 on-disk layout.
 
 Enhancement:
  - reduce memory consumption for SMR drive
  - better discard handling for multiple partitions
  - tracepoints for f2fs_file_write_iter/f2fs_filemap_fault
  - allow to change CP_CHKSUM_OFFSET
  - detect wrong layout of large_nat_bitmap feature
  - enhance checking valid data indices
 
 Bug fix:
  - Multiple partition support for SMR drive
  - deadlock problem in f2fs_balance_fs_bg
  - add boundary checks to fix abnormal behaviors on fuzzed images
  - inline_xattr space calculations
  - replace f2fs_bug_on with errors
 
 In addition, this series contains various memory boundary check and sanity check
 of on-disk consistency.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlzZ/ukACgkQQBSofoJI
 UNKnkQ//U6QsEgNsC5GeNAXPLxodxC406CSo+aRk//3xUdORzkVNia1ThLeYVNfd
 3LsLaSy1eLZ0GylrR/MpvHt+eSWH5L5Ferj2v4l1zAYLr29FEhl6bmYPyvc3e4pr
 IbHwC6W1g1sV2LxeNp7z0chkT7cOAm633d3OVJzUXD5B1k52lx72QnqoXal0Ae1L
 tt3h/TVBIRJX26nSTiEZTBnIDmpiZYSsJVGgjqLnTCXHWUKPPfuJqALiELluhBNJ
 M7gDE3/JYJyb+1PrdtvsEesPJxSLovGJJJvcJKcuMhWwij0Jsq2BwiP3shcfj8iA
 76PiPlhjS6D5sMo1hJzbKettusfZrxX284UHNacrkgA/TBbHeGGBy3Fbh6B3+/o1
 qvCl0atqp3km6Z8vg5r7nDTOMrg0JSjsHz3WA9ZXZ+cSM6mGxk7vYMd7Sn7h2GqY
 deIfWqlTdB8hOFQJWDswXI2ILyJlvquc4jTKIUmHp4ZEQXdYSWlBUWm7+1XZvoYn
 rlrAcr/loSQ/gT4U/Z9RQdpMYb2n2n3+YF8QAeTDmVafMeqbrwspCZf6l8Pfoto1
 ZVYO9QUIsvRBDEiDHdLmRwb1ckTTiHiHrO9YwtA5zlYRRyH93MamQTsH7BTGt0OC
 tjCPbpQGlWK6M9p3LNQ4H5lsdLzD7EEP0JXLOQ57rY3aDaZsOsE=
 =HOz+
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "Another round of various bug fixes came in. Damien improved SMR drive
  support a bit, and Chao replaced BUG_ON() with reporting errors to
  user since we've not hit from users but did hit from crafted images.
  We've found a disk layout bug in large_nat_bits feature which supports
  very large NAT entries enabled at mkfs. If the feature is enabled, it
  will give a notice to run fsck to correct the on-disk layout.

  Enhancements:
   - reduce memory consumption for SMR drive
   - better discard handling for multiple partitions
   - tracepoints for f2fs_file_write_iter/f2fs_filemap_fault
   - allow to change CP_CHKSUM_OFFSET
   - detect wrong layout of large_nat_bitmap feature
   - enhance checking valid data indices

  Bug fixes:
   - Multiple partition support for SMR drive
   - deadlock problem in f2fs_balance_fs_bg
   - add boundary checks to fix abnormal behaviors on fuzzed images
   - inline_xattr space calculations
   - replace f2fs_bug_on with errors

  In addition, this series contains various memory boundary check and
  sanity check of on-disk consistency"

* tag 'f2fs-for-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (40 commits)
  f2fs: fix to avoid accessing xattr across the boundary
  f2fs: fix to avoid potential race on sbi->unusable_block_count access/update
  f2fs: add tracepoint for f2fs_filemap_fault()
  f2fs: introduce DATA_GENERIC_ENHANCE
  f2fs: fix to handle error in f2fs_disable_checkpoint()
  f2fs: remove redundant check in f2fs_file_write_iter()
  f2fs: fix to be aware of readonly device in write_checkpoint()
  f2fs: fix to skip recovery on readonly device
  f2fs: fix to consider multiple device for readonly check
  f2fs: relocate chksum_offset for large_nat_bitmap feature
  f2fs: allow unfixed f2fs_checkpoint.checksum_offset
  f2fs: Replace spaces with tab
  f2fs: insert space before the open parenthesis '('
  f2fs: allow address pointer number of dnode aligning to specified size
  f2fs: introduce f2fs_read_single_page() for cleanup
  f2fs: mark is_extension_exist() inline
  f2fs: fix to set FI_UPDATE_WRITE correctly
  f2fs: fix to avoid panic in f2fs_inplace_write_data()
  f2fs: fix to do sanity check on valid block count of segment
  f2fs: fix to do sanity check on valid node/block count
  ...
2019-05-14 08:55:43 -07:00
Tobin C. Harding
fbcde197e1 gfs2: Fix error path kobject memory leak
If a call to kobject_init_and_add() fails we must call kobject_put()
otherwise we leak memory.

Function gfs2_sys_fs_add always calls kobject_init_and_add() which
always calls kobject_init().

It is safe to leave object destruction up to the kobject release
function and never free it manually.

Remove call to kfree() and always call kobject_put() in the error path.

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-13 15:43:01 -07:00
Linus Torvalds
d4c608115c \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlzZmHgACgkQnJ2qBz9k
 QNleKQf/VglDfB2VvuYnMbZp4YBPZgNMe0KlOTYE60RBg4DGZj04ov0bxSRZccPP
 KvHp1ZyIM4Bp67lKFZoI6bYd5diq7gVjHIdHrFc0LfDmpyaodqxe3bOMWT/TH0YU
 0jjr7MU1wmVRY8UYrIqPIlL6Dl0hlxGf5cYPzoRQB4hzNWEWr9ZEdC1DbKupWTk/
 1GsYtCHZzpZ20YfzDU3jcYaHRVZDwZXb65Wx3OfUVof8q9bkpHd5lhl79jT4yVPS
 yYfqs1CW6ra2zsYDtBdmh46BZ3d36ACfcjmgf2/7GMdQP8Ocv0c5lnSSkrJeC5XO
 GttTFIKU7JWFAqiWmPAoGIchUXNkRA==
 =eSpi
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify fixes from Jan Kara:
 "Two fsnotify fixes"

* tag 'fsnotify_for_v5.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: fix unlink performance regression
  fsnotify: Clarify connector assignment in fsnotify_add_mark_list()
2019-05-13 15:08:16 -07:00
Linus Torvalds
29c079caf5 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlzZljwACgkQnJ2qBz9k
 QNmZ3wf/fMe6rMOFCHE7RT/Nuq+H9G7EVjk+Cch8+EFXPRxDLgQUE03LZ5VzpZw0
 U4SsGFqLO/pGwtGPDRe789hQNqjmCjdEA86wJrUy6UCobeUkHrXU1XL6XnmvKKGP
 UvAFBIz2F0GWCcm4yWlbW25yLf/aFI8t/50/sahfgj+6v9Tezfs3FGVJEta7D/KH
 PNLDx2zMS+aiQJfjo81bEqS/87b4so8ioudFlyMOlwLQslvtR7SzvmvXHxG7VpGY
 pI6dTnXqOjykWWAYDc5J2/D9drbA1QxcanuoRW0Eg9TYPCc8MQVakbQ203GyAPxP
 rEHq6aKi0Fp1vyzKh/Zoa5O7TsgReg==
 =cOTS
 -----END PGP SIGNATURE-----

Merge tag 'fs_for_v5.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull misc filesystem updates from Jan Kara:
 "A couple of small bugfixes and cleanups for quota, udf, ext2, and
  reiserfs"

* tag 'fs_for_v5.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: check time limit when back out space/inode change
  fs/quota: erase unused but set variable warning
  quota: fix wrong indentation
  udf: fix an uninitialized read bug and remove dead code
  fs/reiserfs/journal.c: Make remove_journal_hash static
  quota: remove trailing whitespaces
  quota: code cleanup for __dquot_alloc_space()
  ext2: Adjust the comment of function ext2_alloc_branch
  udf: Explain handling of load_nls() failure
2019-05-13 14:59:55 -07:00
Linus Torvalds
d7a02fa0a8 This pull request contains the following changes for UBI/UBIFS
- fscrypt framework usage updates
 - One huge fix for xattr unlink
 - Cleanup of fscrypt ifdefs
 - Fix for our new UBIFS auth feature
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlzYkIgWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wTRrD/99iBd4f8F0jF1wmB8/9kDAnz5s
 KaK+VtC0RVRijRijYzo+/2kDXpXEbmPycg6AVl5EfKxXCVFw1K7pQvuBX43qyv4o
 BINRv1av8FEBA9eTjvBgZJUrjB1AuvV37716/OeM2bnvuCsp1escnvTEh6S3VFYw
 oWDBgZJd+DE10CYtZjuLoyDPcYdNrzebbmu3Xbfl2XsPwZFUJIrymMd6NE8Xdk3I
 EQbZ3guEM5Djui+nrko3iKzfoZ4eK7WguO3DOEjUHpwea4ZfnZtnlH345aYOAqRE
 N5qrDCzXOsWs6Zs+clODMQgg+aTN3kGBNV534culcpMAbUp7WXynUQ1DDqtOJNJO
 pGFjhAfGi4E6YgB3UwqxMbXxI4Tg/X2ckc77hWZlC7h/1Y/i89nacT6Ij5rPNOn1
 mby1mFxWHI04uSEICWyocFK4m/J2b17Tmte2Mc5ZOigQqREUB7J8wiT4NWm6GhV1
 nTb5DA8MepC3zopbsL/iAiKPhSkH1h6AkabBw1ADTksacgNUfhjzALkxqa64tIqv
 C43QG3n/HqsNZJ4aLdizLLb8KIt4pWsIaqHOeDGSfr3I1GEBrpfKiR72P/h3fSF9
 9GIFJU5HiV+3zeAC2024muaV7KjcimZ6t/hPFTCFH9pMGNk2Mtn/gZFfmqnjLKbj
 TDxUTrZF9Lujonrbwg==
 =ymCJ
 -----END PGP SIGNATURE-----

Merge tag 'upstream-5.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/rw/ubifs

Pull UBI/UBIFS updates from Richard Weinberger:

 - fscrypt framework usage updates

 - One huge fix for xattr unlink

 - Cleanup of fscrypt ifdefs

 - Fix for our new UBIFS auth feature

* tag 'upstream-5.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
  ubi: wl: Fix uninitialized variable
  ubifs: Drop unnecessary setting of zbr->znode
  ubifs: Remove ifdefs around CONFIG_UBIFS_ATIME_SUPPORT
  ubifs: Remove #ifdef around CONFIG_FS_ENCRYPTION
  ubifs: Limit number of xattrs per inode
  ubifs: orphan: Handle xattrs like files
  ubifs: journal: Handle xattrs like files
  ubifs: find.c: replace swap function with built-in one
  ubifs: Do not skip hash checking in data nodes
  ubifs: work around high stack usage with clang
  ubifs: remove unused function __ubifs_shash_final
  ubifs: remove unnecessary #ifdef around fscrypt_ioctl_get_policy()
  ubifs: remove unnecessary calls to set up directory key
2019-05-12 18:16:31 -04:00
Linus Torvalds
983dfa4b6e This pull request contains the following changes for UML:
- Kconfig cleanups
 - Fix cpu_all_mask() usage
 - Various bug fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlzYi30WHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wdDcD/wLx0xljjSb+j08VVSvVWGah1Vl
 DMVyLp1Eik8KRnc6vR+IfC6qDE2+QmJvcLLx4IQ8wpgce+mvhLSy0+8SNsU9tz7t
 7ZYVR++L3If3dx72J1aJquQt4PNLQn7QAdPWOA/FiYy4mqjxZUg4HVwf/Oge/2Un
 jfom649xl1gdcYlXTCOadb4Xmqo1BSEW+Ms1zqrQlBpU6ePMvojPkjBMdaCbCjMg
 bLt4XjtVbgBH3FnH0ZvuDzrMW229LiLot4KF0iUW36/gV/ZRATbinst5AQ5mUsMP
 GgrqbeU+wDdzt73p/l1NG7u3DZHOhoAW1ZWTqwBMKiazQiJPa90V9TIOwbnSl7zc
 hBEKKkU/u6p5E5TADcTty9ZJfCM+3Zatqt004WSbi+ug363G08XrTb3wWz6AruQ/
 9shTUmzwYsK1Bzllf2T2WShBrN+vMdmpzf4+v66N1KhcPrb7Eh81N/VhQG+rvfSb
 Ju/lDhu6OxlHr9OlGinI0SCLgjpk3qWcNd1noFdQsTewIopQsOL6H4R7711md3ow
 PWl7HAspvCRD3ub12y0wS3bb/4AUyoBrMDT/VBfk2vH0BbCzlR/ckaKE+lk2Y2Mr
 BpURt1zcqnpqi5LqRC//dhCFPyzpXd+yYVy1P6bN8q5lvfuIoaRdl2YeWjMfoo0v
 r+loEdGNa57Qj67ncg==
 =HB9o
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/rw/uml

Pull UML updates from Richard Weinberger:

 - Kconfig cleanups

 - Fix cpu_all_mask() usage

 - Various bug fixes

* tag 'for-linus-5.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/rw/uml:
  um: irq: don't set the chip for all irqs
  um: define set_pte_at() as a static inline function, not a macro
  um: remove uses of variable length arrays
  um: remove unused variable
  uml: fix a boot splat wrt use of cpu_all_mask
  um: Do not unlock mutex that is not hold.
  hostfs: fix mismatch between link_file definition and declaration
  arch: um: drivers: Kconfig: pedantic formatting
  arch: um: Kconfig: pedantic indention cleanups
  um: Revert to using stack for pt_regs in signal handling
2019-05-12 17:52:13 -04:00
Linus Torvalds
8ea5b2abd0 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs mount fix from Al Viro:
 "Fix for umount -l/mount --move race caught by syzbot yesterday..."

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  do_move_mount(): fix an unsafe use of is_anon_ns()
2019-05-09 19:35:41 -07:00
Linus Torvalds
06cbd26d31 NFS client updates for Linux 5.2
Stable bugfixes:
 - Fall back to MDS if no deviceid is found rather than aborting   # v4.11+
 - NFS4: Fix v4.0 client state corruption when mount
 
 Features:
 - Much improved handling of soft mounts with NFS v4.0
   - Reduce risk of false positive timeouts
   - Faster failover of reads and writes after a timeout
   - Added a "softerr" mount option to return ETIMEDOUT instead of
     EIO to the application after a timeout
 - Increase number of xprtrdma backchannel requests
 - Add additional xprtrdma tracepoints
 - Improved send completion batching for xprtrdma
 
 Other bugfixes and cleanups:
 - Return -EINVAL when NFS v4.2 is passed an invalid dedup mode
 - Reduce usage of GFP_ATOMIC pages in SUNRPC
 - Various minor NFS over RDMA cleanups and bugfixes
 - Use the correct container namespace for upcalls
 - Don't share superblocks between user namespaces
 - Various other container fixes
 - Make nfs_match_client() killable to prevent soft lockups
 - Don't mark all open state for recovery when handling recallable state revoked flag
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlzUjdcACgkQ18tUv7Cl
 QOsUiw/+OirzlZI7XeHfpZ/CwS7A+tSk3AAg9PDS1gjbfylER0g++GpA08tXnmDt
 JdUnBKYC5ujLyAqxN1j7QK+EvmXZQro8rucJxhEdPJMIQDC65fQQnmW7efl2bAEv
 CAWNDCf9Xe4g6X8LSR5jrnaMV4kuOQBYX4wqrrmaV8I+g/A/GKXW262KWnAv+w1M
 Y1ZlX+d1Gm8hODXhvqz4lldW6bkyrpWpU9BKUtYSYnSR0x1fam6PLPuCTm74fEDR
 N/Tgy5XvJi4xgti4SOZ/dI2O/Oqu6ut81PEPlhs8sTX04G8bLhr+hl3rSksCZFlu
 Afz9Hcnxg6XYB3Va7j7AO67H5SbyX4Zyj5cRMipXQE7Ebc1iXo5lu3vdhAEOAtNx
 fdNJlqD86MC/XWbtM+DfWlD+KjtpZ+lkxN+xuMgC/kVaPTeFI7nEWM796hJP/4no
 EYtnSLbSpJyH6F7wH9IL5V2EJYFxbzTvnPSTxV+QNZ0HgF17gTY0AGmQBzDE5bF0
 tfQteOG6MYXMHg64pTEzjlowlXOWdnE5TnuaFpt64/yP+hVznZMepBMSkxZO1xYt
 jc1wQlJkv/SyVH7cMGsj5lw3A6zwTrLManDUUmrLjIsVVmh4dk8WKlNtWQmvf1v6
 nFBklUa2GzH8LWKRT2ftNGcUeEiCuw/QF9oE5T/V7/7SQ/wmmvA=
 =skb2
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.2-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Highlights include:

  Stable bugfixes:
   - Fall back to MDS if no deviceid is found rather than aborting   # v4.11+
   - NFS4: Fix v4.0 client state corruption when mount

  Features:
   - Much improved handling of soft mounts with NFS v4.0:
       - Reduce risk of false positive timeouts
       - Faster failover of reads and writes after a timeout
       - Added a "softerr" mount option to return ETIMEDOUT instead of
         EIO to the application after a timeout
   - Increase number of xprtrdma backchannel requests
   - Add additional xprtrdma tracepoints
   - Improved send completion batching for xprtrdma

  Other bugfixes and cleanups:
   - Return -EINVAL when NFS v4.2 is passed an invalid dedup mode
   - Reduce usage of GFP_ATOMIC pages in SUNRPC
   - Various minor NFS over RDMA cleanups and bugfixes
   - Use the correct container namespace for upcalls
   - Don't share superblocks between user namespaces
   - Various other container fixes
   - Make nfs_match_client() killable to prevent soft lockups
   - Don't mark all open state for recovery when handling recallable
     state revoked flag"

* tag 'nfs-for-5.2-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (69 commits)
  SUNRPC: Rebalance a kref in auth_gss.c
  NFS: Fix a double unlock from nfs_match,get_client
  nfs: pass the correct prototype to read_cache_page
  NFSv4: don't mark all open state for recovery when handling recallable state revoked flag
  SUNRPC: Fix an error code in gss_alloc_msg()
  SUNRPC: task should be exit if encode return EKEYEXPIRED more times
  NFS4: Fix v4.0 client state corruption when mount
  PNFS fallback to MDS if no deviceid found
  NFS: make nfs_match_client killable
  lockd: Store the lockd client credential in struct nlm_host
  NFS: When mounting, don't share filesystems between different user namespaces
  NFS: Convert NFSv2 to use the container user namespace
  NFSv4: Convert the NFS client idmapper to use the container user namespace
  NFS: Convert NFSv3 to use the container user namespace
  SUNRPC: Use namespace of listening daemon in the client AUTH_GSS upcall
  SUNRPC: Use the client user namespace when encoding creds
  NFS: Store the credential of the mount process in the nfs_server
  SUNRPC: Cache cred of process creating the rpc_client
  xprtrdma: Remove stale comment
  xprtrdma: Update comments that reference ib_drain_qp
  ...
2019-05-09 14:33:15 -07:00
Benjamin Coddington
c260121a97 NFS: Fix a double unlock from nfs_match,get_client
Now that nfs_match_client drops the nfs_client_lock, we should be
careful
to always return it in the same condition: locked.

Fixes: 950a578c61 ("NFS: make nfs_match_client killable")
Reported-by: syzbot+228a82b263b5da91883d@syzkaller.appspotmail.com
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-05-09 16:26:57 -04:00
Christoph Hellwig
a46126ccc7 nfs: pass the correct prototype to read_cache_page
Fix the callbacks NFS passes to read_cache_page to actually have the
proper type expected.  Casting around function pointers can easily
hide typing bugs, and defeats control flow protection.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-05-09 16:26:57 -04:00
Scott Mayhew
8ca017c8ce NFSv4: don't mark all open state for recovery when handling recallable state revoked flag
Only delegations and layouts can be recalled, so it shouldn't be
necessary to recover all opens when handling the status bit
SEQ4_STATUS_RECALLABLE_STATE_REVOKED.  We'll still wind up calling
nfs41_open_expired() when a TEST_STATEID returns NFS4ERR_DELEG_REVOKED.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-05-09 16:26:57 -04:00
ZhangXiaoxu
f02f3755db NFS4: Fix v4.0 client state corruption when mount
stat command with soft mount never return after server is stopped.

When alloc a new client, the state of the client will be set to
NFS4CLNT_LEASE_EXPIRED.

When the server is stopped, the state manager will work, and accord
the state to recover. But the state is NFS4CLNT_LEASE_EXPIRED, it
will drain the slot table and lead other task to wait queue, until
the client recovered. Then the stat command is hung.

When discover server trunking, the client will renew the lease,
but check the client state, it lead the client state corruption.

So, we need to call state manager to recover it when detect server
ip trunking.

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-05-09 16:26:05 -04:00
Olga Kornievskaia
b1029c9bc0 PNFS fallback to MDS if no deviceid found
If we fail to find a good deviceid while trying to pnfs instead of
propogating an error back fallback to doing IO to the MDS. Currently,
code with fals the IO with EINVAL.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: 8d40b0f148 ("NFS filelayout:call GETDEVICEINFO after pnfs_layout_process completes"
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-05-09 16:24:56 -04:00
Randall Huang
2777e65437 f2fs: fix to avoid accessing xattr across the boundary
When we traverse xattr entries via __find_xattr(),
if the raw filesystem content is faked or any hardware failure occurs,
out-of-bound error can be detected by KASAN.
Fix the issue by introducing boundary check.

[   38.402878] c7   1827 BUG: KASAN: slab-out-of-bounds in f2fs_getxattr+0x518/0x68c
[   38.402891] c7   1827 Read of size 4 at addr ffffffc0b6fb35dc by task
[   38.402935] c7   1827 Call trace:
[   38.402952] c7   1827 [<ffffff900809003c>] dump_backtrace+0x0/0x6bc
[   38.402966] c7   1827 [<ffffff9008090030>] show_stack+0x20/0x2c
[   38.402981] c7   1827 [<ffffff900871ab10>] dump_stack+0xfc/0x140
[   38.402995] c7   1827 [<ffffff9008325c40>] print_address_description+0x80/0x2d8
[   38.403009] c7   1827 [<ffffff900832629c>] kasan_report_error+0x198/0x1fc
[   38.403022] c7   1827 [<ffffff9008326104>] kasan_report_error+0x0/0x1fc
[   38.403037] c7   1827 [<ffffff9008325000>] __asan_load4+0x1b0/0x1b8
[   38.403051] c7   1827 [<ffffff90085fcc44>] f2fs_getxattr+0x518/0x68c
[   38.403066] c7   1827 [<ffffff90085fc508>] f2fs_xattr_generic_get+0xb0/0xd0
[   38.403080] c7   1827 [<ffffff9008395708>] __vfs_getxattr+0x1f4/0x1fc
[   38.403096] c7   1827 [<ffffff9008621bd0>] inode_doinit_with_dentry+0x360/0x938
[   38.403109] c7   1827 [<ffffff900862d6cc>] selinux_d_instantiate+0x2c/0x38
[   38.403123] c7   1827 [<ffffff900861b018>] security_d_instantiate+0x68/0x98
[   38.403136] c7   1827 [<ffffff9008377db8>] d_splice_alias+0x58/0x348
[   38.403149] c7   1827 [<ffffff900858d16c>] f2fs_lookup+0x608/0x774
[   38.403163] c7   1827 [<ffffff900835eacc>] lookup_slow+0x1e0/0x2cc
[   38.403177] c7   1827 [<ffffff9008367fe0>] walk_component+0x160/0x520
[   38.403190] c7   1827 [<ffffff9008369ef4>] path_lookupat+0x110/0x2b4
[   38.403203] c7   1827 [<ffffff900835dd38>] filename_lookup+0x1d8/0x3a8
[   38.403216] c7   1827 [<ffffff900835eeb0>] user_path_at_empty+0x54/0x68
[   38.403229] c7   1827 [<ffffff9008395f44>] SyS_getxattr+0xb4/0x18c
[   38.403241] c7   1827 [<ffffff9008084200>] el0_svc_naked+0x34/0x38

Signed-off-by: Randall Huang <huangrandall@google.com>
[Jaegeuk Kim: Fix wrong ending boundary]
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-05-09 09:43:29 -07:00
Linus Torvalds
8823880561 Orangefs: This pull request includes one fix and our "Orangefs through
the pagecache" patch series which greatly improves our small IO
 performance and helps us pass more xfstests than before.
 
 fix: orangefs: truncate before updating size
 
 Pagecache series: all the rest
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJc1CucAAoJEM9EDqnrzg2+2h4P/i2ktG4Zi3VDwCr1xea0bWSf
 kidYxzVPmGAmfC+UNMn/+X4v/R6rtHcpW/g7/0WLqhLUhRwjB1gzT6U9mHfu8BQV
 1I7tg6lQqcTi4gNkBMWI4MHKRyoQ5bAgY7jFxLjqx1zj5+qxrtPn4GUCUbDBDPpP
 Ip9RD/izEKeG77gdSUkCD72ojHl9lqmT1Qd3k3asmozzdWzd8mdXy8EQplECo7tZ
 1tLzV+BSER2BVQfmws2CCG00DLFzQRJkDYxm7I5l1xoQ/Ft8n6k/tXZ1QzkPREg2
 FhaoY5eCA/2MkB0VhUKj1Y+gXNZcbys11oxcqyZZ6p2kOu+1bY+evkzwDUzsIAg+
 nl1d3LoeEVKoxpa4O+vCHLF++HX8TuVHSaxvNaZPMm86gxowCh0aberyWcscGbQp
 m2M28WdjlzYC7uPSakoKVvF+ZBNKQP5sn6jbjYSOHAB9nFz6x5ILH3SMK8MXoZZe
 2Ejd4NBmtqekdLp2MNlAWHF1O9sTagH+OqUvbNRNjrQXM3hNSgeguGx1b0E+KlT2
 lwKORIVwZNZ/xuizdwM01kWRQObAmGrptlVbR4+HKqo99g9gGv6sdw//cHzo9/2S
 GV34cpS5VRSwHVwuzADD1A/qQ+k5xomihrpZwcqh+YOwQV16JnTHiSeTYZKVIEW8
 cBiZ2k5dbKm1QcLc9tJ3
 =uBS+
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.2-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Marshall:
 "This includes one fix and our "Orangefs through the pagecache" patch
  series which greatly improves our small IO performance and helps us
  pass more xfstests than before.

  Fix:
   - orangefs: truncate before updating size

  Pagecache series:
   - all the rest"

* tag 'for-linus-5.2-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: (23 commits)
  orangefs: truncate before updating size
  orangefs: copy Orangefs-sized blocks into the pagecache if possible.
  orangefs: pass slot index back to readpage.
  orangefs: remember count when reading.
  orangefs: add orangefs_revalidate_mapping
  orangefs: implement writepages
  orangefs: write range tracking
  orangefs: avoid fsync service operation on flush
  orangefs: skip inode writeout if nothing to write
  orangefs: move do_readv_writev to direct_IO
  orangefs: do not return successful read when the client-core disappeared
  orangefs: implement writepage
  orangefs: migrate to generic_file_read_iter
  orangefs: service ops done for writeback are not killable
  orangefs: remove orangefs_readpages
  orangefs: reorganize setattr functions to track attribute changes
  orangefs: let setattr write to cached inode
  orangefs: set up and use backing_dev_info
  orangefs: hold i_lock during inode_getattr
  orangefs: update attributes rather than relying on server
  ...
2019-05-09 09:37:25 -07:00
Amir Goldstein
4d8e7055a4 fsnotify: fix unlink performance regression
__fsnotify_parent() has an optimization in place to avoid unneeded
take_dentry_name_snapshot().  When fsnotify_nameremove() was changed
not to call __fsnotify_parent(), we left out the optimization.
Kernel test robot reported a 5% performance regression in concurrent
unlink() workload.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Link: https://lore.kernel.org/lkml/20190505062153.GG29809@shao2-debian/
Link: https://lore.kernel.org/linux-fsdevel/20190104090357.GD22409@quack2.suse.cz/
Fixes: 5f02a87763 ("fsnotify: annotate directory entry modification events")
CC: stable@vger.kernel.org
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-05-09 12:44:00 +02:00