linux_dsm_epyc7002/mm
Fengguang Wu 3cf23841b4 mm/vmscan.c: avoid possible deadlock caused by too_many_isolated()
Neil found that if too_many_isolated() returns true while performing
direct reclaim we can end up waiting for other threads to complete their
direct reclaim.  If those threads are allowed to enter the FS or IO to
free memory, but this thread is not, then it is possible that those
threads will be waiting on this thread and so we get a circular deadlock.

some task enters direct reclaim with GFP_KERNEL
  => too_many_isolated() false
    => vmscan and run into dirty pages
      => pageout()
        => take some FS lock
          => fs/block code does GFP_NOIO allocation
            => enter direct reclaim again
              => too_many_isolated() true
                => waiting for others to progress, however the other
                   tasks may be circular waiting for the FS lock..

The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
priority than normal ones, by lowering the throttle threshold for the
latter.

Allowing ~1/8 isolated pages in normal is large enough.  For example, for
a 1GB LRU list, that's ~128MB isolated pages, or 1k blocked tasks (each
isolates 32 4KB pages), or 64 blocked tasks per logical CPU (assuming 16
logical CPUs per NUMA node).  So it's not likely some CPU goes idle
waiting (when it could make progress) because of this limit: there are
much more sleeping reclaim tasks than the number of CPU, so the task may
well be blocked by some low level queue/lock anyway.

Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to progress.
 They will be blocked only when there are too many concurrent !GFP_IOFS
reclaims, however that's very unlikely because the IO-less direct reclaims
is able to progress much more faster, and they won't deadlock each other.
The threshold is raised high enough for them, so that there can be
sufficient parallel progress of !GFP_IOFS reclaims.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Torsten Kaiser <just.for.lkml@googlemail.com>
Tested-by: NeilBrown <neilb@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-18 15:02:15 -08:00
..
backing-dev.c Revert "bdi: add a user-tunable cpu_list for the bdi flusher threads" 2012-12-17 11:29:09 -08:00
balloon_compaction.c
bootmem.c mm/bootmem.c: remove unused wrapper function reserve_bootmem_generic() 2012-12-12 17:38:35 -08:00
bounce.c
cleancache.c
compaction.c Automatic NUMA Balancing V11 2012-12-16 15:18:08 -08:00
debug-pagealloc.c
dmapool.c
fadvise.c
failslab.c
filemap_xip.c
filemap.c
fremap.c
frontswap.c
highmem.c
huge_memory.c mm: fix kernel BUG at huge_memory.c:1474! 2012-12-16 19:02:38 -08:00
hugetlb_cgroup.c mm/hugetlb: create hugetlb cgroup file in hugetlb_init 2012-12-18 15:02:15 -08:00
hugetlb.c mm/hugetlb: create hugetlb cgroup file in hugetlb_init 2012-12-18 15:02:15 -08:00
hwpoison-inject.c
init-mm.c
internal.h Automatic NUMA Balancing V11 2012-12-16 15:18:08 -08:00
interval_tree.c
Kconfig memory-hotplug: document and enable CONFIG_MOVABLE_NODE 2012-12-18 15:02:12 -08:00
Kconfig.debug
kmemcheck.c
kmemleak-test.c
kmemleak.c mm/kmemleak.c: remove obsolete simple_strtoul 2012-12-18 15:02:15 -08:00
ksm.c Automatic NUMA Balancing V11 2012-12-16 15:18:08 -08:00
maccess.c
madvise.c
Makefile
memblock.c
memcontrol.c slab: propagate tunable values 2012-12-18 15:02:14 -08:00
memory_hotplug.c mm/memory_hotplug.c: improve comments 2012-12-18 15:02:15 -08:00
memory-failure.c Automatic NUMA Balancing V11 2012-12-16 15:18:08 -08:00
memory.c mm: use kbasename() 2012-12-17 17:15:17 -08:00
mempolicy.c Automatic NUMA Balancing V11 2012-12-16 15:18:08 -08:00
mempool.c
migrate.c mm,numa: fix update_mmu_cache_pmd call 2012-12-17 19:37:03 -08:00
mincore.c
mlock.c
mm_init.c
mmap.c Automatic NUMA Balancing V11 2012-12-16 15:18:08 -08:00
mmu_context.c
mmu_notifier.c
mmzone.c
mprotect.c mm/mprotect.c: coding-style cleanups 2012-12-18 15:02:15 -08:00
mremap.c Automatic NUMA Balancing V11 2012-12-16 15:18:08 -08:00
msync.c
nobootmem.c mm: introduce new field "managed_pages" to struct zone 2012-12-12 17:38:34 -08:00
nommu.c
oom_kill.c mm, oom: remove redundant sleep in pagefault oom handler 2012-12-12 17:38:34 -08:00
page_alloc.c mm: allocate kernel pages to the right memcg 2012-12-18 15:02:12 -08:00
page_cgroup.c memcontrol: use N_MEMORY instead N_HIGH_MEMORY 2012-12-12 17:38:32 -08:00
page_io.c
page_isolation.c
page-writeback.c
pagewalk.c
percpu-km.c
percpu-vm.c
percpu.c
pgtable-generic.c
process_vm_access.c
quicklist.c
readahead.c
rmap.c Automatic NUMA Balancing V11 2012-12-16 15:18:08 -08:00
shmem.c lseek: the "whence" argument is called "whence" 2012-12-17 17:15:12 -08:00
slab_common.c slab: propagate tunable values 2012-12-18 15:02:14 -08:00
slab.c memcg: add comments clarifying aspects of cache attribute propagation 2012-12-18 15:02:15 -08:00
slab.h slab: propagate tunable values 2012-12-18 15:02:14 -08:00
slob.c sl[au]b: always get the cache from its page in kmem_cache_free() 2012-12-18 15:02:14 -08:00
slub.c slub: drop mutex before deleting sysfs entry 2012-12-18 15:02:15 -08:00
sparse-vmemmap.c
sparse.c
swap_state.c
swap.c
swapfile.c
truncate.c
util.c
vmalloc.c
vmscan.c mm/vmscan.c: avoid possible deadlock caused by too_many_isolated() 2012-12-18 15:02:15 -08:00
vmstat.c Automatic NUMA Balancing V11 2012-12-16 15:18:08 -08:00