linux_dsm_epyc7002/mm
Kirill A. Shutemov b5a8cad376 thp: close race between split and zap huge pages
Sasha Levin has reported two THP BUGs[1][2].  I believe both of them
have the same root cause.  Let's look to them one by one.

The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".  It's
BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().  From my
testing I see that page_mapcount() is higher than mapcount here.

I think it happens due to race between zap_huge_pmd() and
page_check_address_pmd().  page_check_address_pmd() misses PMD which is
under zap:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  /*
	   * We check if PMD present without taking ptl: no
	   * serialization against zap_huge_pmd(). We miss this PMD,
	   * it's not accounted to 'mapcount' in __split_huge_page().
	   */
	  pmd_present(pmd) == 0

  BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!

						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)

The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().

This happens in similar way:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  pmd_present(pmd) == 0	/* The same comment as above */
  /*
   * No crash this time since we already decremented page->_mapcount in
   * zap_huge_pmd().
   */
  BUG_ON(mapcount != page_mapcount(page))

  /*
   * We split the compound page here into small pages without
   * serialization against zap_huge_pmd()
   */
  __split_huge_page_refcount()
						VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!

So my understanding the problem is pmd_present() check in mm_find_pmd()
without taking page table lock.

The bug was introduced by me commit with commit 117b0791ac. Sorry for
that. :(

Let's open code mm_find_pmd() in page_check_address_pmd() and do the
check under page table lock.

Note that __page_check_address() does the same for PTE entires
if sync != 0.

I've stress tested split and zap code paths for 36+ hours by now and
don't see crashes with the patch applied. Before it took <20 min to
trigger the first bug and few hours for second one (if we ignore
first).

[1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com>
[2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>	[3.13+]

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18 16:40:09 -07:00
..
backing-dev.c bdi: avoid oops on device removal 2014-04-03 16:20:49 -07:00
balloon_compaction.c mm: print more details for bad_page() 2014-01-23 16:36:50 -08:00
bootmem.c mm/bootmem.c: remove unused local `map' 2013-11-13 12:09:09 +09:00
bounce.c block: Convert bio_for_each_segment() to bvec_iter 2013-11-23 22:33:49 -08:00
cleancache.c mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE 2014-01-23 16:36:50 -08:00
compaction.c mm, compaction: determine isolation mode only once 2014-04-07 16:35:55 -07:00
debug-pagealloc.c mm, x86: Remove debug_pagealloc_enabled 2011-12-06 09:24:07 +01:00
dmapool.c dmapool: make DMAPOOL_DEBUG detect corruption of free marker 2012-12-11 17:22:24 -08:00
early_ioremap.c mm: create generic early_ioremap() support 2014-04-07 16:36:15 -07:00
fadvise.c teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long 2013-03-03 22:46:22 -05:00
failslab.c switch debugfs to umode_t 2012-01-03 22:54:56 -05:00
filemap_xip.c seqcount: Add lockdep functionality to seqcount/seqlock structures 2013-11-06 12:40:26 +01:00
filemap.c mm: fix new kernel-doc warning in filemap.c 2014-04-18 16:40:09 -07:00
fremap.c mm: fix bad rss-counter if remap_file_pages raced migration 2014-03-19 16:21:49 -07:00
frontswap.c frontswap: fix incorrect zeroing and allocation size for frontswap_map 2013-06-12 16:29:46 -07:00
highmem.c Some nice cleanups, and even a patch my wife did as a "live" demo for 2012-12-20 08:37:05 -08:00
huge_memory.c thp: close race between split and zap huge pages 2014-04-18 16:40:09 -07:00
hugetlb_cgroup.c cgroup: drop const from @buffer of cftype->write_string() 2014-03-19 10:23:54 -04:00
hugetlb.c mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages() 2014-04-18 16:40:08 -07:00
hwpoison-inject.c mm/hwpoison: add '#' to hwpoison_inject 2014-01-21 16:19:48 -08:00
init-mm.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
internal.h mm/readahead.c: inline ra_submit 2014-04-07 16:35:58 -07:00
interval_tree.c mm: add CONFIG_DEBUG_VM_RB build option 2012-10-09 16:22:42 +09:00
iov_iter.c take iov_iter stuff to mm/iov_iter.c 2014-04-01 23:19:30 -04:00
Kconfig mm: create generic early_ioremap() support 2014-04-07 16:36:15 -07:00
Kconfig.debug mm: more intensive memory corruption debugging 2012-01-10 16:30:42 -08:00
kmemcheck.c
kmemleak-test.c
kmemleak.c kmemleak: change some global variables to int 2014-04-03 16:20:50 -07:00
ksm.c mm: close PageTail race 2014-03-04 07:55:47 -08:00
list_lru.c mm: keep page cache radix tree nodes in check 2014-04-03 16:21:01 -07:00
maccess.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
madvise.c mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood 2013-09-30 14:31:02 -07:00
Makefile Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-04-12 14:49:50 -07:00
memblock.c mm/memblock.c: use PFN_PHYS() 2014-04-07 16:35:58 -07:00
memcontrol.c memcg, slab: do not destroy children caches if parent has aliases 2014-04-07 16:36:13 -07:00
memory_hotplug.c mm/memory_hotplug.c: move register_memory_resource out of the lock_memory_hotplug 2014-01-23 16:36:52 -08:00
memory-failure.c Merge branch 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2014-04-03 13:05:42 -07:00
memory.c mm: remove unused arg of set_page_dirty_balance() 2014-04-07 16:35:57 -07:00
mempolicy.c mm, mempolicy: remove per-process flag 2014-04-07 16:35:54 -07:00
mempool.c mempool: add unlikely and likely hints 2014-04-07 16:35:55 -07:00
migrate.c mm: fix swapops.h:131 bug if remap_file_pages raced migration 2014-03-20 22:09:09 -07:00
mincore.c mm + fs: prepare for non-page entries in page cache radix trees 2014-04-03 16:21:00 -07:00
mlock.c mm: try_to_unmap_cluster() should lock_page() before mlocking 2014-04-07 16:35:57 -07:00
mm_init.c mm: bring back /sys/kernel/mm 2014-01-27 21:02:39 -08:00
mmap.c mm: per-thread vma caching 2014-04-07 16:35:53 -07:00
mmu_context.c sched/mm: call finish_arch_post_lock_switch in idle_task_exit and use_mm 2014-02-21 08:50:17 +01:00
mmu_notifier.c mm: audit/fix non-modular users of module_init in core code 2014-01-23 16:36:52 -08:00
mmzone.c mm: numa: Change page last {nid,pid} into {cpu,pid} 2013-10-09 14:47:45 +02:00
mprotect.c mm: move mmu notifier call from change_protection to change_pmd_range 2014-04-07 16:35:50 -07:00
mremap.c mm: revert mremap pud_free anti-fix 2013-10-16 21:35:53 -07:00
msync.c
nobootmem.c mm/nobootmem.c: mark function as static 2014-04-03 16:21:02 -07:00
nommu.c mm: fix 'ERROR: do not initialise globals to 0 or NULL' and coding style 2014-04-07 16:35:55 -07:00
oom_kill.c mm, oom: base root bonus on current usage 2014-01-30 16:56:56 -08:00
page_alloc.c mm/page_alloc.c: change mm debug routines back to EXPORT_SYMBOL 2014-04-07 16:35:59 -07:00
page_cgroup.c mm/page_cgroup.c: mark functions as static 2014-04-03 16:21:02 -07:00
page_io.c Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block 2014-01-30 11:19:05 -08:00
page_isolation.c mm: memory-hotplug: enable memory hotplug to handle hugepage 2013-09-11 15:57:48 -07:00
page-writeback.c mm: remove unused arg of set_page_dirty_balance() 2014-04-07 16:35:57 -07:00
pagewalk.c mm/pagewalk.c: fix walk_page_range() access of wrong PTEs 2013-10-30 14:27:03 -07:00
percpu-km.c
percpu-vm.c mm: fix kernel-doc warnings 2012-06-20 14:39:36 -07:00
percpu.c percpu: renew the max_contig if we merge the head and previous block 2014-03-29 09:29:42 -04:00
pgtable-generic.c mm: fix TLB flush race between migration, and change_protection_range 2013-12-18 19:04:51 -08:00
process_vm_access.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-04-12 14:49:50 -07:00
quicklist.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
readahead.c mm/readahead.c: inline ra_submit 2014-04-07 16:35:58 -07:00
rmap.c mm: try_to_unmap_cluster() should lock_page() before mlocking 2014-04-07 16:35:57 -07:00
shmem.c mm: Initialize error in shmem_file_aio_read() 2014-04-13 14:10:26 -07:00
slab_common.c memcg, slab: do not destroy children caches if parent has aliases 2014-04-07 16:36:13 -07:00
slab.c Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux 2014-04-13 13:28:13 -07:00
slab.h memcg, slab: never try to merge memcg caches 2014-04-07 16:36:12 -07:00
slob.c mm: slab/slub: use page->list consistently instead of page->lru 2014-04-11 10:06:06 +03:00
slub.c Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux 2014-04-13 13:28:13 -07:00
sparse-vmemmap.c mm/sparse: use memblock apis for early memory allocations 2014-01-21 16:19:47 -08:00
sparse.c mm: use macros from compiler.h instead of __attribute__((...)) 2014-04-07 16:35:54 -07:00
swap_state.c swap: add a simple detector for inappropriate swapin readahead 2014-02-06 13:48:51 -08:00
swap.c mm: thrash detection-based file cache sizing 2014-04-03 16:21:01 -07:00
swapfile.c mm/swap: fix race on swap_info reuse between swapoff and swapon 2014-02-06 13:48:51 -08:00
truncate.c mm: keep page cache radix tree nodes in check 2014-04-03 16:21:01 -07:00
util.c Merge git://git.infradead.org/users/eparis/audit 2014-04-12 12:38:53 -07:00
vmacache.c mm: per-thread vma caching 2014-04-07 16:35:53 -07:00
vmalloc.c mm/vmalloc.c: enhance vm_map_ram() comment 2014-04-07 16:35:55 -07:00
vmpressure.c arm, pm, vmpressure: add missing slab.h includes 2014-02-03 13:24:01 -05:00
vmscan.c vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state() 2014-04-18 16:40:07 -07:00
vmstat.c CPU hotplug notifiers registration fixes for 3.15-rc1 2014-04-07 14:55:46 -07:00
workingset.c mm: keep page cache radix tree nodes in check 2014-04-03 16:21:01 -07:00
zbud.c mm/zbud: fix some trivial typos in comments 2013-09-11 15:57:35 -07:00
zsmalloc.c zsmalloc: Fix CPU hotplug callback registration 2014-03-20 13:43:45 +01:00
zswap.c Merge branch 'akpm' (incoming from Andrew) 2014-04-07 16:38:06 -07:00