linux_dsm_epyc7002/mm
Joonsoo Kim 474750aba8 vmalloc: use rcu list iterator to reduce vmap_area_lock contention
Richard Yao reported a month ago that his system have a trouble with
vmap_area_lock contention during performance analysis by /proc/meminfo.
Andrew asked why his analysis checks /proc/meminfo stressfully, but he
didn't answer it.

  https://lkml.org/lkml/2014/4/10/416

Although I'm not sure that this is right usage or not, there is a
solution reducing vmap_area_lock contention with no side-effect.  That
is just to use rcu list iterator in get_vmalloc_info().

rcu can be used in this function because all RCU protocol is already
respected by writers, since Nick Piggin commit db64fe0225 ("mm:
rewrite vmap layer") back in linux-2.6.28

Specifically :
   insertions use list_add_rcu(),
   deletions use list_del_rcu() and kfree_rcu().

Note the rb tree is not used from rcu reader (it would not be safe),
only the vmap_area_list has full RCU protection.

Note that __purge_vmap_area_lazy() already uses this rcu protection.

        rcu_read_lock();
        list_for_each_entry_rcu(va, &vmap_area_list, list) {
                if (va->flags & VM_LAZY_FREE) {
                        if (va->va_start < *start)
                                *start = va->va_start;
                        if (va->va_end > *end)
                                *end = va->va_end;
                        nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
                        list_add_tail(&va->purge_list, &valist);
                        va->flags |= VM_LAZY_FREEING;
                        va->flags &= ~VM_LAZY_FREE;
                }
        }
        rcu_read_unlock();

Peter:

: While rcu list traversal over the vmap_area_list is safe, this may
: arrive at different results than the spinlocked version. The rcu list
: traversal version will not be a 'snapshot' of a single, valid instant
: of the entire vmap_area_list, but rather a potential amalgam of
: different list states.

Joonsoo:

: Yes, you are right, but I don't think that we should be strict here.
: Meminfo is already not a 'snapshot' at specific time.  While we try to get
: certain stats, the other stats can change.  And, although we may arrive at
: different results than the spinlocked version, the difference would not be
: large and would not make serious side-effect.

[edumazet@google.com: add more commit description]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: Richard Yao <ryao@gentoo.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:15 -07:00
..
backing-dev.c arch: Mass conversion of smp_mb__*() 2014-04-18 14:20:48 +02:00
balloon_compaction.c mm: print more details for bad_page() 2014-01-23 16:36:50 -08:00
bootmem.c mm/bootmem.c: remove unused local `map' 2013-11-13 12:09:09 +09:00
cleancache.c mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE 2014-01-23 16:36:50 -08:00
compaction.c mm, compaction: properly signal and act upon lock and need_sched() contention 2014-06-04 16:54:11 -07:00
debug-pagealloc.c mm, x86: Remove debug_pagealloc_enabled 2011-12-06 09:24:07 +01:00
dmapool.c mm/dmapool.c: reuse devres_release() to free resources 2014-06-04 16:54:08 -07:00
early_ioremap.c mm: create generic early_ioremap() support 2014-04-07 16:36:15 -07:00
fadvise.c teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long 2013-03-03 22:46:22 -05:00
failslab.c switch debugfs to umode_t 2012-01-03 22:54:56 -05:00
filemap_xip.c seqcount: Add lockdep functionality to seqcount/seqlock structures 2013-11-06 12:40:26 +01:00
filemap.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-08-04 16:23:30 -07:00
fremap.c mm: mark remap_file_pages() syscall as deprecated 2014-06-06 16:08:17 -07:00
frontswap.c swap: change swap_list_head to plist, add swap_avail_head 2014-06-04 16:54:07 -07:00
gup.c mm: cleanup __get_user_pages() 2014-06-04 16:54:05 -07:00
highmem.c Some nice cleanups, and even a patch my wife did as a "live" demo for 2012-12-20 08:37:05 -08:00
huge_memory.c mm: let mm_find_pmd fix buggy race with THP fault 2014-06-23 16:47:44 -07:00
hugetlb_cgroup.c cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() 2014-07-15 11:05:09 -04:00
hugetlb.c kexec: export free_huge_page to VMCOREINFO 2014-07-30 17:16:13 -07:00
hwpoison-inject.c mm/hwpoison: add '#' to hwpoison_inject 2014-01-21 16:19:48 -08:00
init-mm.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
internal.h mm, compaction: properly signal and act upon lock and need_sched() contention 2014-06-04 16:54:11 -07:00
interval_tree.c mm: add CONFIG_DEBUG_VM_RB build option 2012-10-09 16:22:42 +09:00
iov_iter.c bio_vec-backed iov_iter 2014-05-06 17:39:45 -04:00
Kconfig mm/zsmalloc: make zsmalloc module-buildable 2014-06-04 16:54:14 -07:00
Kconfig.debug mm: more intensive memory corruption debugging 2012-01-10 16:30:42 -08:00
kmemcheck.c
kmemleak-test.c mm/kmemleak-test.c: use pr_fmt for logging 2014-06-06 16:08:18 -07:00
kmemleak.c mm: introduce kmemleak_update_trace() 2014-06-06 16:08:17 -07:00
ksm.c sched: Remove proliferation of wait_on_bit() action functions 2014-07-16 15:10:39 +02:00
list_lru.c mm: keep page cache radix tree nodes in check 2014-04-03 16:21:01 -07:00
maccess.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
madvise.c mm: madvise: fix MADV_WILLNEED on shmem swapouts 2014-05-23 09:37:29 -07:00
Makefile mm: move get_user_pages()-related code to separate file 2014-06-04 16:54:04 -07:00
memblock.c mm/memblock.c: call kmemleak directly from memblock_(alloc|free) 2014-06-06 16:08:17 -07:00
memcontrol.c Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2014-08-04 10:11:28 -07:00
memory_hotplug.c mm/memory_hotplug.c: add __meminit to grow_zone_span/grow_pgdat_span 2014-08-06 18:01:15 -07:00
memory-failure.c hwpoison: call action_result() in failure path of hwpoison_user_mappings() 2014-07-30 17:16:13 -07:00
memory.c mm: debugfs: move rounddown_pow_of_two() out from do_fault path 2014-07-30 17:16:13 -07:00
mempolicy.c Merge branch 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2014-07-10 11:38:23 -07:00
mempool.c mm/mempool.c: update the kmemleak stack trace for mempool allocations 2014-06-06 16:08:17 -07:00
migrate.c mm: fix direct reclaim writeback regression 2014-07-26 14:38:50 -07:00
mincore.c mm + fs: prepare for non-page entries in page cache radix trees 2014-04-03 16:21:00 -07:00
mlock.c mm: try_to_unmap_cluster() should lock_page() before mlocking 2014-04-07 16:35:57 -07:00
mm_init.c mm: bring back /sys/kernel/mm 2014-01-27 21:02:39 -08:00
mmap.c mm: convert some level-less printks to pr_* 2014-06-06 16:08:18 -07:00
mmu_context.c sched/mm: call finish_arch_post_lock_switch in idle_task_exit and use_mm 2014-02-21 08:50:17 +01:00
mmu_notifier.c mm: audit/fix non-modular users of module_init in core code 2014-01-23 16:36:52 -08:00
mmzone.c mm: numa: Change page last {nid,pid} into {cpu,pid} 2013-10-09 14:47:45 +02:00
mprotect.c mm: move mmu notifier call from change_protection to change_pmd_range 2014-04-07 16:35:50 -07:00
mremap.c mm, thp: close race between mremap() and split_huge_page() 2014-05-11 17:55:48 +09:00
msync.c msync: fix incorrect fstart calculation 2014-07-03 09:21:53 -07:00
nobootmem.c mm/memblock.c: call kmemleak directly from memblock_(alloc|free) 2014-06-06 16:08:17 -07:00
nommu.c mm: nommu: per-thread vma cache fix 2014-06-23 16:47:43 -07:00
oom_kill.c mm, oom: base root bonus on current usage 2014-01-30 16:56:56 -08:00
page_alloc.c mm/page_alloc.c: unexport alloc_pages_exact_nid() 2014-08-06 18:01:15 -07:00
page_cgroup.c mm/page_cgroup.c: mark functions as static 2014-04-03 16:21:02 -07:00
page_io.c fix __swap_writepage() compile failure on old gcc versions 2014-06-14 19:30:48 -05:00
page_isolation.c mm: memory-hotplug: enable memory hotplug to handle hugepage 2013-09-11 15:57:48 -07:00
page-writeback.c mm/page-writeback.c: fix divide by zero in bdi_dirty_limits() 2014-07-30 17:16:13 -07:00
pagewalk.c mm/pagewalk.c: fix walk_page_range() access of wrong PTEs 2013-10-30 14:27:03 -07:00
percpu-km.c
percpu-vm.c mm: fix kernel-doc warnings 2012-06-20 14:39:36 -07:00
percpu.c percpu: Use ALIGN macro instead of hand coding alignment calculation 2014-06-19 11:00:27 -04:00
pgtable-generic.c mm: fix TLB flush race between migration, and change_protection_range 2013-12-18 19:04:51 -08:00
process_vm_access.c start adding the tag to iov_iter 2014-05-06 17:32:49 -04:00
quicklist.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
readahead.c mm/readahead.c: remove unused file_ra_state from count_history_pages 2014-08-06 18:01:15 -07:00
rmap.c mm/rmap.c: fix pgoff calculation to handle hugepage correctly 2014-07-23 15:10:54 -07:00
shmem.c shmem: fix splicing from a hole while it's punched 2014-07-23 15:10:55 -07:00
slab_common.c mm: move slab related stuff from util.c to slab_common.c 2014-08-06 18:01:15 -07:00
slab.c mm/slab.c: fix comments 2014-08-06 18:01:15 -07:00
slab.h slab: convert last use of __FUNCTION__ to __func__ 2014-08-06 18:01:15 -07:00
slob.c slab: get_online_mems for kmem_cache_{create,destroy,shrink} 2014-06-04 16:53:59 -07:00
slub.c slab: fix the alias count (via sysfs) of slab cache 2014-08-06 18:01:15 -07:00
sparse-vmemmap.c mm/sparse: use memblock apis for early memory allocations 2014-01-21 16:19:47 -08:00
sparse.c mm: use macros from compiler.h instead of __attribute__((...)) 2014-04-07 16:35:54 -07:00
swap_state.c mm: page_alloc: convert hot/cold parameter and immediate callers to bool 2014-06-04 16:54:09 -07:00
swap.c mm: non-atomically mark page accessed during page cache allocation where possible 2014-06-04 16:54:10 -07:00
swapfile.c mm/swapfile.c: delete the "last_in_cluster < scan_base" loop in the body of scan_swap_map() 2014-06-04 16:54:12 -07:00
truncate.c mm/fs: fix pessimization in hole-punching pagecache 2014-07-23 15:10:55 -07:00
util.c mm: move slab related stuff from util.c to slab_common.c 2014-08-06 18:01:15 -07:00
vmacache.c mm,vmacache: optimize overflow system-wide flushing 2014-06-04 16:53:57 -07:00
vmalloc.c vmalloc: use rcu list iterator to reduce vmap_area_lock contention 2014-08-06 18:01:15 -07:00
vmpressure.c arm, pm, vmpressure: add missing slab.h includes 2014-02-03 13:24:01 -05:00
vmscan.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
vmstat.c mm: use the light version __mod_zone_page_state in mlocked_vma_newpage() 2014-06-04 16:54:07 -07:00
workingset.c mm: keep page cache radix tree nodes in check 2014-04-03 16:21:01 -07:00
zbud.c mm/zbud.c: make size unsigned like unique callsite 2014-06-04 16:54:13 -07:00
zsmalloc.c zsmalloc: fixup trivial zs size classes value in comments 2014-06-04 16:54:13 -07:00
zswap.c mm/zswap: NUMA aware allocation for zswap_dstmem 2014-06-04 16:54:14 -07:00