linux_dsm_epyc7002/mm
Mike Kravetz 5e41540c8a hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
This bug has been experienced several times by the Oracle DB team.  The
BUG is in remove_inode_hugepages() as follows:

	/*
	 * If page is mapped, it was faulted in after being
	 * unmapped in caller.  Unmap (again) now after taking
	 * the fault mutex.  The mutex will prevent faults
	 * until we finish removing the page.
	 *
	 * This race can only happen in the hole punch case.
	 * Getting here in a truncate operation is a bug.
	 */
	if (unlikely(page_mapped(page))) {
		BUG_ON(truncate_op);

In this case, the elevated map count is not the result of a race.
Rather it was incorrectly incremented as the result of a bug in the huge
pmd sharing code.  Consider the following:

 - Process A maps a hugetlbfs file of sufficient size and alignment
   (PUD_SIZE) that a pmd page could be shared.

 - Process B maps the same hugetlbfs file with the same size and
   alignment such that a pmd page is shared.

 - Process B then calls mprotect() to change protections for the mapping
   with the shared pmd. As a result, the pmd is 'unshared'.

 - Process B then calls mprotect() again to chage protections for the
   mapping back to their original value. pmd remains unshared.

 - Process B then forks and process C is created. During the fork
   process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
   tables. Copying page tables for hugetlb mappings is done in the
   routine copy_hugetlb_page_range.

In copy_hugetlb_page_range(), the destination pte is obtained by:

	dst_pte = huge_pte_alloc(dst, addr, sz);

If pmd sharing is possible, the returned pointer will be to a pte in an
existing page table.  In the situation above, process C could share with
either process A or process B.  Since process A is first in the list,
the returned pte is a pointer to a pte in process A's page table.

However, the check for pmd sharing in copy_hugetlb_page_range is:

	/* If the pagetables are shared don't copy or take references */
	if (dst_pte == src_pte)
		continue;

Since process C is sharing with process A instead of process B, the
above test fails.  The code in copy_hugetlb_page_range which follows
assumes dst_pte points to a huge_pte_none pte.  It copies the pte entry
from src_pte to dst_pte and increments this map count of the associated
page.  This is how we end up with an elevated map count.

To solve, check the dst_pte entry for huge_pte_none.  If !none, this
implies PMD sharing so do not copy.

Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
Fixes: c5c99429fa ("fix hugepages leak due to pagetable page sharing")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-18 10:15:09 -08:00
..
kasan mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
backing-dev.c
balloon_compaction.c
cleancache.c
cma_debug.c
cma.c
cma.h
compaction.c psi: pressure stall information for CPU, memory, and IO 2018-10-26 16:26:32 -07:00
debug_page_ref.c
debug.c mm: provide kernel parameter to allow disabling page init poisoning 2018-10-26 16:26:34 -07:00
dmapool.c
early_ioremap.c
fadvise.c
failslab.c
filemap.c vfs: rework data cloning infrastructure 2018-11-02 09:33:08 -07:00
frame_vector.c
frontswap.c
gup_benchmark.c mm/gup_benchmark.c: prevent integer overflow in ioctl 2018-10-31 08:54:12 -07:00
gup.c mm/gup.c: fix __get_user_pages_fast() comment 2018-10-31 08:54:17 -07:00
highmem.c
hmm.c mm/hmm: invalidate device page table at start of invalidation 2018-10-31 08:54:12 -07:00
huge_memory.c mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask 2018-11-03 10:09:37 -07:00
hugetlb_cgroup.c
hugetlb.c hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444! 2018-11-18 10:15:09 -08:00
hwpoison-inject.c
init-mm.c
internal.h memblock: rename __free_pages_bootmem to memblock_free_pages 2018-10-31 08:54:16 -07:00
interval_tree.c
Kconfig mm: remove CONFIG_HAVE_MEMBLOCK 2018-10-31 08:54:15 -07:00
Kconfig.debug
khugepaged.c mm: Convert khugepaged_scan_shmem to XArray 2018-10-21 10:46:38 -04:00
kmemleak-test.c
kmemleak.c mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
ksm.c
list_lru.c
maccess.c
madvise.c Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax 2018-10-28 11:35:40 -07:00
Makefile mm: remove nobootmem 2018-10-31 08:54:16 -07:00
memblock.c mm/memblock.c: warn if zero alignment was requested 2018-10-31 08:54:17 -07:00
memcontrol.c mm: handle no memcg case in memcg_kmem_charge() properly 2018-11-03 10:09:37 -07:00
memfd.c memfd: Convert memfd_tag_pins to XArray 2018-10-21 10:46:41 -04:00
memory_hotplug.c memory_hotplug: cond_resched in __remove_pages 2018-11-03 10:09:38 -07:00
memory-failure.c
memory.c mm: Fix warning in insert_pfn() 2018-10-31 08:54:17 -07:00
mempolicy.c mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask 2018-11-03 10:09:37 -07:00
mempool.c
memtest.c
migrate.c Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax 2018-10-28 11:35:40 -07:00
mincore.c xarray: Replace exceptional entries 2018-09-29 22:47:49 -04:00
mlock.c
mm_init.c
mmap.c mm: brk: downgrade mmap_sem to read when shrinking 2018-10-26 16:26:35 -07:00
mmu_context.c
mmu_gather.c
mmu_notifier.c Revert "mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks" 2018-10-26 16:25:19 -07:00
mmzone.c
mprotect.c
mremap.c mm: mremap: downgrade mmap_sem to read when shrinking 2018-10-26 16:26:35 -07:00
msync.c
nommu.c mm/gup: cache dev_pagemap while pinning pages 2018-10-26 16:38:15 -07:00
oom_kill.c Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2018-10-24 11:22:39 +01:00
page_alloc.c memblock: stop using implicit alignment to SMP_CACHE_BYTES 2018-10-31 08:54:16 -07:00
page_counter.c
page_ext.c mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
page_idle.c mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
page_io.c for-linus-20181102 2018-11-02 11:25:48 -07:00
page_isolation.c
page_owner.c mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
page_poison.c virtio, vhost: fixes, tweaks 2018-11-01 14:42:49 -07:00
page_vma_mapped.c mm/rmap: map_pte() was not handling private ZONE_DEVICE page properly 2018-10-31 08:54:11 -07:00
page-writeback.c Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax 2018-10-28 11:35:40 -07:00
pagewalk.c
percpu-internal.h
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c Merge branch 'for-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu 2018-11-01 09:27:57 -07:00
pgtable-generic.c x86/mm: Page size aware flush_tlb_mm_range() 2018-10-09 16:51:11 +02:00
process_vm_access.c
quicklist.c
readahead.c mm: Convert __do_page_cache_readahead to XArray 2018-10-21 10:46:37 -04:00
rmap.c mm: migration: fix migration of huge PMD shared pages 2018-10-05 16:32:04 -07:00
rodata_test.c
shmem.c mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask 2018-11-03 10:09:37 -07:00
slab_common.c mm, slab: shorten kmalloc cache names for large sizes 2018-10-26 16:26:32 -07:00
slab.c mm, slab: combine kmalloc_caches and kmalloc_dma_caches 2018-10-26 16:26:31 -07:00
slab.h
slob.c
slub.c mm, slab: combine kmalloc_caches and kmalloc_dma_caches 2018-10-26 16:26:31 -07:00
sparse-vmemmap.c mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
sparse.c memblock: stop using implicit alignment to SMP_CACHE_BYTES 2018-10-31 08:54:16 -07:00
swap_cgroup.c
swap_slots.c
swap_state.c Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax 2018-10-28 11:35:40 -07:00
swap.c Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax 2018-10-28 11:35:40 -07:00
swapfile.c mm: export add_swap_extent() 2018-10-26 16:38:15 -07:00
truncate.c mm: Convert truncate to XArray 2018-10-21 10:46:37 -04:00
usercopy.c
userfaultfd.c
util.c kvfree(): fix misleading comment 2018-10-26 16:26:33 -07:00
vmacache.c mm: get rid of vmacache_flush_all() entirely 2018-09-13 15:18:04 -10:00
vmalloc.c vfree: add debug might_sleep() 2018-10-26 16:26:33 -07:00
vmpressure.c
vmscan.c Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax 2018-10-28 11:35:40 -07:00
vmstat.c mm/vmstat.c: assert that vmstat_text is in sync with stat_items_size 2018-10-26 16:26:35 -07:00
workingset.c Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax 2018-10-28 11:35:40 -07:00
z3fold.c z3fold: fix possible reclaim races 2018-11-18 10:15:09 -08:00
zbud.c
zpool.c
zsmalloc.c mm/zsmalloc.c: fix fall-through annotation 2018-10-26 16:26:35 -07:00
zswap.c