linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-04 02:06:43 +07:00

History

Nitin Gupta facdaa917c mm: proactive compaction For some applications, we need to allocate almost all memory as hugepages. However, on a running system, higher-order allocations can fail if the memory is fragmented. Linux kernel currently does on-demand compaction as we request more hugepages, but this style of compaction incurs very high latency. Experiments with one-time full memory compaction (followed by hugepage allocations) show that kernel is able to restore a highly fragmented memory state to a fairly compacted memory state within <1 sec for a 32G system. Such data suggests that a more proactive compaction can help us allocate a large fraction of memory as hugepages keeping allocation latencies low. For a more proactive compaction, the approach taken here is to define a new sysctl called 'vm.compaction_proactiveness' which dictates bounds for external fragmentation which kcompactd tries to maintain. The tunable takes a value in range [0, 100], with a default of 20. Note that a previous version of this patch [1] was found to introduce too many tunables (per-order extfrag{low, high}), but this one reduces them to just one sysctl. Also, the new tunable is an opaque value instead of asking for specific bounds of "external fragmentation", which would have been difficult to estimate. The internal interpretation of this opaque value allows for future fine-tuning. Currently, we use a simple translation from this tunable to [low, high] "fragmentation score" thresholds (low=100-proactiveness, high=low+10%). The score for a node is defined as weighted mean of per-zone external fragmentation. A zone's present_pages determines its weight. To periodically check per-node score, we reuse per-node kcompactd threads, which are woken up every 500 milliseconds to check the same. If a node's score exceeds its high threshold (as derived from user-provided proactiveness value), proactive compaction is started until its score reaches its low threshold value. By default, proactiveness is set to 20, which implies threshold values of low=80 and high=90. This patch is largely based on ideas from Michal Hocko [2]. See also the LWN article [3]. Performance data ================ System: x64_64, 1T RAM, 80 CPU threads. Kernel: 5.6.0-rc3 + this patch echo madvise \| sudo tee /sys/kernel/mm/transparent_hugepage/enabled echo madvise \| sudo tee /sys/kernel/mm/transparent_hugepage/defrag Before starting the driver, the system was fragmented from a userspace program that allocates all memory and then for each 2M aligned section, frees 3/4 of base pages using munmap. The workload is mainly anonymous userspace pages, which are easy to move around. I intentionally avoided unmovable pages in this test to see how much latency we incur when hugepage allocations hit direct compaction. 1. Kernel hugepage allocation latencies With the system in such a fragmented state, a kernel driver then allocates as many hugepages as possible and measures allocation latency: (all latency values are in microseconds) - With vanilla 5.6.0-rc3 percentile latency –––––––––– ––––––– 5 7894 10 9496 25 12561 30 15295 40 18244 50 21229 60 27556 75 30147 80 31047 90 32859 95 33799 Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G total free => 98% of free memory could be allocated as hugepages) - With 5.6.0-rc3 + this patch, with proactiveness=20 sysctl -w vm.compaction_proactiveness=20 percentile latency –––––––––– ––––––– 5 2 10 2 25 3 30 3 40 3 50 4 60 4 75 4 80 4 90 5 95 429 Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G total free => 98% of free memory could be allocated as hugepages) 2. JAVA heap allocation In this test, we first fragment memory using the same method as for (1). Then, we start a Java process with a heap size set to 700G and request the heap to be allocated with THP hugepages. We also set THP to madvise to allow hugepage backing of this heap. /usr/bin/time java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch The above command allocates 700G of Java heap using hugepages. - With vanilla 5.6.0-rc3 17.39user 1666.48system 27:37.89elapsed - With 5.6.0-rc3 + this patch, with proactiveness=20 8.35user 194.58system 3:19.62elapsed Elapsed time remains around 3:15, as proactiveness is further increased. Note that proactive compaction happens throughout the runtime of these workloads. The situation of one-time compaction, sufficient to supply hugepages for following allocation stream, can probably happen for more extreme proactiveness values, like 80 or 90. In the above Java workload, proactiveness is set to 20. The test starts with a node's score of 80 or higher, depending on the delay between the fragmentation step and starting the benchmark, which gives more-or-less time for the initial round of compaction. As t he benchmark consumes hugepages, node's score quickly rises above the high threshold (90) and proactive compaction starts again, which brings down the score to the low threshold level (80). Repeat. bpftrace also confirms proactive compaction running 20+ times during the runtime of this Java benchmark. kcompactd threads consume 100% of one of the CPUs while it tries to bring a node's score within thresholds. Backoff behavior ================ Above workloads produce a memory state which is easy to compact. However, if memory is filled with unmovable pages, proactive compaction should essentially back off. To test this aspect: - Created a kernel driver that allocates almost all memory as hugepages followed by freeing first 3/4 of each hugepage. - Set proactiveness=40 - Note that proactive_compact_node() is deferred maximum number of times with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check (=> ~30 seconds between retries). [1] https://patchwork.kernel.org/patch/11098289/ [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/ [3] https://lwn.net/Articles/817905/ Signed-off-by: Nitin Gupta <nigupta@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Oleksandr Natalenko <oleksandr@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Nitin Gupta <ngupta@nitingupta.dev> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2020-08-12 10:57:56 -07:00
..
kasan	Kbuild updates for v5.9	2020-08-09 14:10:26 -07:00
backing-dev.c	writeback: remove struct bdi_writeback_congested	2020-07-08 17:05:53 -06:00
balloon_compaction.c	mm/balloon_compaction: suppress allocation warnings	2019-09-04 07:42:01 -04:00
cleancache.c	Driver Core and debugfs changes for 5.3-rc1	2019-07-12 12:24:03 -07:00
cma_debug.c	debugfs: make sure we can remove u32_array files cleanly	2020-07-10 13:54:00 -07:00
cma.c	mm/cma.c: use exact_nid true to fix possible per-numa cma leak	2020-07-03 16:15:25 -07:00
cma.h	debugfs: make sure we can remove u32_array files cleanly	2020-07-10 13:54:00 -07:00
compaction.c	mm: proactive compaction	2020-08-12 10:57:56 -07:00
debug_page_ref.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
debug_vm_pgtable.c	Documentation/mm: add descriptions for arch page table helpers	2020-08-07 11:33:23 -07:00
debug.c	mm, dump_page: do not crash with bad compound_mapcount()	2020-08-07 11:33:23 -07:00
dmapool.c	mm/dmapool.c: micro-optimisation remove unnecessary branch	2020-04-07 10:43:42 -07:00
early_ioremap.c	mm/early_ioremap.c: use %pa to print resource_size_t variables	2020-01-31 10:30:38 -08:00
fadvise.c	mm: return void from various readahead functions	2020-06-02 10:59:06 -07:00
failslab.c	mm/failslab.c: by default, do not fail allocations with direct reclaim only	2019-07-12 11:05:43 -07:00
filemap.c	mm: filemap: add missing FGP_ flags in kerneldoc comment for pagecache_get_page	2020-08-07 11:33:23 -07:00
frame_vector.c	mmap locking API: convert mmap_sem comments	2020-06-09 09:39:14 -07:00
frontswap.c	treewide: Remove uninitialized_var() usage	2020-07-16 12:35:15 -07:00
gup_benchmark.c	mm/gup_benchmark: support pin_user_pages() and related calls	2020-04-02 09:35:27 -07:00
gup.c	mm/gup.c: fix the comment of return value for populate_vma_page_range()	2020-08-07 11:33:23 -07:00
highmem.c	mm, x86/mm: Untangle address space layout definitions from basic pgtable type definitions	2019-12-10 10:12:55 +01:00
hmm.c	mm/hmm: provide the page mapping order in hmm_range_fault()	2020-07-10 16:24:28 -03:00
huge_memory.c	mm/vmscan: protect the workingset on anonymous LRU	2020-08-12 10:57:55 -07:00
hugetlb_cgroup.c	mm: use fallthrough;	2020-04-07 10:43:41 -07:00
hugetlb.c	mm/hugetlb: add mempolicy check in the reservation routine	2020-08-12 10:57:55 -07:00
hwpoison-inject.c	mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops	2019-12-01 12:59:09 -08:00
init-mm.c	mmap locking API: add MMAP_LOCK_INITIALIZER	2020-06-09 09:39:14 -07:00
internal.h	mm: proactive compaction	2020-08-12 10:57:56 -07:00
interval_tree.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 248	2019-06-19 17:09:08 +02:00
ioremap.c	mm: move p?d_alloc_track to separate header file	2020-08-07 11:33:26 -07:00
Kconfig	mm/sparse: cleanup the code surrounding memory_present()	2020-08-07 11:33:27 -07:00
Kconfig.debug	treewide: replace '---help---' in Kconfig files with 'help'	2020-06-14 01:57:21 +09:00
khugepaged.c	mm/vmscan: protect the workingset on anonymous LRU	2020-08-12 10:57:55 -07:00
kmemleak-test.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 333	2019-06-05 17:37:06 +02:00
kmemleak.c	mm/kmemleak.c: use address-of operator on section symbols	2020-04-02 09:35:26 -07:00
ksm.c	powerpc updates for 5.9	2020-08-07 10:33:50 -07:00
list_lru.c	mm/list_lru.c: Rename kvfree_rcu() to local variant	2020-06-29 11:59:25 -07:00
maccess.c	maccess: rename probe_user_{read,write} to copy_{from,to}_user_nofault	2020-06-17 10:57:41 -07:00
madvise.c	mmap locking API: convert mmap_sem comments	2020-06-09 09:39:14 -07:00
Makefile	mm: move lib/ioremap.c to mm/	2020-08-07 11:33:26 -07:00
mapping_dirty_helpers.c	mm/mapping_dirty_helpers: update huge page-table entry callbacks	2020-04-02 09:35:29 -07:00
memblock.c	mm/memblock: expose only miminal interface to add/walk physmem	2020-07-10 15:08:09 +02:00
memcontrol.c	mm/workingset: prepare the workingset detection infrastructure for anon LRU	2020-08-12 10:57:55 -07:00
memfd.c	mm: page cache: store only head pages in i_pages	2019-09-24 15:54:08 -07:00
memory_hotplug.c	mm/memory_hotplug: document why shuffle_zone() is relevant	2020-08-07 11:33:29 -07:00
memory-failure.c	mm/memory-failure: send SIGBUS(BUS_MCEERR_AR) only to current thread	2020-06-11 18:17:47 -07:00
memory.c	mm/swap: implement workingset detection for anonymous LRU	2020-08-12 10:57:56 -07:00
mempolicy.c	mm/hugetlb: add mempolicy check in the reservation routine	2020-08-12 10:57:55 -07:00
mempool.c	docs/core-api/mm: fix return value descriptions in mm/	2019-03-05 21:07:20 -08:00
memremap.c	mm/memremap: set caching mode for PCI P2PDMA memory to WC	2020-04-10 15:36:21 -07:00
memtest.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
migrate.c	mm/vmscan: protect the workingset on anonymous LRU	2020-08-12 10:57:55 -07:00
mincore.c	mmap locking API: use coccinelle to convert mmap_sem rwsem call sites	2020-06-09 09:39:14 -07:00
mlock.c	mmap locking API: convert mmap_sem comments	2020-06-09 09:39:14 -07:00
mm_init.c	mm: adjust vm_committed_as_batch according to vm overcommit policy	2020-08-07 11:33:26 -07:00
mmap.c	mm: remove unnecessary wrapper function do_mmap_pgoff()	2020-08-07 11:33:27 -07:00
mmu_gather.c	mmap locking API: convert mmap_sem comments	2020-06-09 09:39:14 -07:00
mmu_notifier.c	mmap locking API: convert mmap_sem comments	2020-06-09 09:39:14 -07:00
mmzone.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
mprotect.c	mmap locking API: convert mmap_sem comments	2020-06-09 09:39:14 -07:00
mremap.c	mm/mremap: start addresses are properly aligned	2020-08-07 11:33:27 -07:00
msync.c	mmap locking API: use coccinelle to convert mmap_sem rwsem call sites	2020-06-09 09:39:14 -07:00
nommu.c	mm: remove unnecessary wrapper function do_mmap_pgoff()	2020-08-07 11:33:27 -07:00
oom_kill.c	mm: memcg: convert vmstat slab counters to bytes	2020-08-07 11:33:24 -07:00
page_alloc.c	mm/page_alloc: fix memalloc_nocma_{save/restore} APIs	2020-08-07 11:33:29 -07:00
page_counter.c	mm/page_counter.c: fix protection usage propagation	2020-08-07 11:33:26 -07:00
page_ext.c	mm/page_ext.c: drop pfn_present() check when onlining	2020-04-07 10:43:40 -07:00
page_idle.c	mm/page_idle.c: skip offline pages	2020-06-08 11:05:55 -07:00
page_io.c	mm/page_io.c: use blk_io_schedule() for avoiding task hung in sync io	2020-08-07 11:33:24 -07:00
page_isolation.c	mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE	2020-06-04 15:36:52 -04:00
page_owner.c	mm: rename gfpflags_to_migratetype to gfp_migratetype for same convention	2020-06-03 20:09:45 -07:00
page_poison.c	mm/page_poison.c: fix a typo in a comment	2019-09-24 15:54:08 -07:00
page_reporting.c	mm/page_reporting: add budget limit on how many pages can be reported per pass	2020-04-07 10:43:39 -07:00
page_reporting.h	mm: introduce include/linux/pgtable.h	2020-06-09 09:39:13 -07:00
page_vma_mapped.c	mm/page_vma_mapped.c: explicitly compare pfn for normal, hugetlbfs and THP page	2020-01-31 10:30:38 -08:00
page-writeback.c	mm: remove vm_total_pages	2020-08-07 11:33:28 -07:00
pagewalk.c	mmap locking API: convert mmap_sem comments	2020-06-09 09:39:14 -07:00
percpu-internal.h	mm: memcg/percpu: account percpu memory to memory cgroups	2020-08-12 10:57:55 -07:00
percpu-km.c	mm: memcg/percpu: account percpu memory to memory cgroups	2020-08-12 10:57:55 -07:00
percpu-stats.c	mm: memcg/percpu: account percpu memory to memory cgroups	2020-08-12 10:57:55 -07:00
percpu-vm.c	mm: memcg/percpu: account percpu memory to memory cgroups	2020-08-12 10:57:55 -07:00
percpu.c	mm: memcg/percpu: per-memcg percpu memory statistics	2020-08-12 10:57:55 -07:00
pgalloc-track.h	mm: move p?d_alloc_track to separate header file	2020-08-07 11:33:26 -07:00
pgtable-generic.c	mm: introduce include/linux/pgtable.h	2020-06-09 09:39:13 -07:00
process_vm_access.c	mmap locking API: use coccinelle to convert mmap_sem rwsem call sites	2020-06-09 09:39:14 -07:00
ptdump.c	mmap locking API: use coccinelle to convert mmap_sem rwsem call sites	2020-06-09 09:39:14 -07:00
readahead.c	mm: use memalloc_nofs_save in readahead path	2020-06-02 10:59:07 -07:00
rmap.c	mmap locking API: convert mmap_sem comments	2020-06-09 09:39:14 -07:00
rodata_test.c	maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault	2020-06-17 10:57:41 -07:00
shmem.c	mm/swapcache: support to handle the shadow entries	2020-08-12 10:57:55 -07:00
shuffle.c	mm/shuffle: remove dynamic reconfiguration	2020-08-07 11:33:29 -07:00
shuffle.h	mm/shuffle: remove dynamic reconfiguration	2020-08-07 11:33:29 -07:00
slab_common.c	mm: memcg/slab: use a single set of kmem_caches for all allocations	2020-08-07 11:33:25 -07:00
slab.c	mm: slab: rename (un)charge_slab_page() to (un)account_slab_page()	2020-08-07 11:33:25 -07:00
slab.h	mm: slab: rename (un)charge_slab_page() to (un)account_slab_page()	2020-08-07 11:33:25 -07:00
slob.c	mm: memcg: convert vmstat slab counters to bytes	2020-08-07 11:33:24 -07:00
slub.c	mm: slab: rename (un)charge_slab_page() to (un)account_slab_page()	2020-08-07 11:33:25 -07:00
sparse-vmemmap.c	mm/sparse: only sub-section aligned range would be populated	2020-08-07 11:33:27 -07:00
sparse.c	mm/sparse: cleanup the code surrounding memory_present()	2020-08-07 11:33:27 -07:00
swap_cgroup.c	mm: memcontrol: make swap tracking an integral part of memory control	2020-06-03 20:09:48 -07:00
swap_slots.c	mm/swap_slots.c: remove redundant check for swap_slot_cache_initialized	2020-08-07 11:33:24 -07:00
swap_state.c	mm/swap: implement workingset detection for anonymous LRU	2020-08-12 10:57:56 -07:00
swap.c	mm/vmscan: protect the workingset on anonymous LRU	2020-08-12 10:57:55 -07:00
swapfile.c	mm/swapcache: support to handle the shadow entries	2020-08-12 10:57:55 -07:00
truncate.c	mm/thp: allow dropping THP from page cache	2019-10-19 06:32:33 -04:00
usercopy.c	usercopy: Avoid HIGHMEM pfn warning	2019-09-17 15:20:17 -07:00
userfaultfd.c	mm/vmscan: protect the workingset on anonymous LRU	2020-08-12 10:57:55 -07:00
util.c	mm: remove unnecessary wrapper function do_mmap_pgoff()	2020-08-07 11:33:27 -07:00
vmacache.c	kernel: better document the use_mm/unuse_mm API contract	2020-06-10 19:14:18 -07:00
vmalloc.c	mm/vmalloc.c: remove BUG() from the find_va_links()	2020-08-07 11:33:28 -07:00
vmpressure.c	mm: vmpressure: use mem_cgroup_is_root API	2020-04-02 09:35:31 -07:00
vmscan.c	mm/vmscan: restore active/inactive ratio for anonymous LRU	2020-08-12 10:57:56 -07:00
vmstat.c	mm: proactive compaction	2020-08-12 10:57:56 -07:00
workingset.c	mm/swap: implement workingset detection for anonymous LRU	2020-08-12 10:57:56 -07:00
z3fold.c	mm/z3fold: silence kmemleak false positives of slots	2020-05-28 11:35:40 -07:00
zbud.c	mm: use false for bool variable	2020-06-04 19:06:24 -07:00
zpool.c	zpool: add malloc_support_movable to zpool_driver	2019-09-24 15:54:12 -07:00
zsmalloc.c	mm: reorder includes after introduction of linux/pgtable.h	2020-06-09 09:39:13 -07:00
zswap.c	mm/zswap: allow setting default status, compressor and allocator in Kconfig	2020-04-07 10:43:41 -07:00