linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-11-24 14:20:55 +07:00

History

Chris Down 0e4b01df86 mm, memcg: throttle allocators when failing reclaim over memory.high We're trying to use memory.high to limit workloads, but have found that containment can frequently fail completely and cause OOM situations outside of the cgroup. This happens especially with swap space -- either when none is configured, or swap is full. These failures often also don't have enough warning to allow one to react, whether for a human or for a daemon monitoring PSI. Here is output from a simple program showing how long it takes in usec (column 2) to allocate a megabyte of anonymous memory (column 1) when a cgroup is already beyond its memory high setting, and no swap is available: [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \ > --wait -t timeout 300 /root/mdf [...] 95 1035 96 1038 97 1000 98 1036 99 1048 100 1590 101 1968 102 1776 103 1863 104 1757 105 1921 106 1893 107 1760 108 1748 109 1843 110 1716 111 1924 112 1776 113 1831 114 1766 115 1836 116 1588 117 1912 118 1802 119 1857 120 1731 [...] [System OOM in 2-3 seconds] The delay does go up extremely marginally past the 100MB memory.high threshold, as now we spend time scanning before returning to usermode, but it's nowhere near enough to contain growth. It also doesn't get worse the more pages you have, since it only considers nr_pages. The current situation goes against both the expectations of users of memory.high, and our intentions as cgroup v2 developers. In cgroup-v2.txt, we claim that we will throttle and only under "extreme conditions" will memory.high protection be breached. Likewise, cgroup v2 users generally also expect that memory.high should throttle workloads as they exceed their high threshold. However, as seen above, this isn't always how it works in practice -- even on banal setups like those with no swap, or where swap has become exhausted, we can end up with memory.high being breached and us having no weapons left in our arsenal to combat runaway growth with, since reclaim is futile. It's also hard for system monitoring software or users to tell how bad the situation is, as "high" events for the memcg may in some cases be benign, and in others be catastrophic. The current status quo is that we fail containment in a way that doesn't provide any advance warning that things are about to go horribly wrong (for example, we are about to invoke the kernel OOM killer). This patch introduces explicit throttling when reclaim is failing to keep memcg size contained at the memory.high setting. It does so by applying an exponential delay curve derived from the memcg's overage compared to memory.high. In the normal case where the memcg is either below or only marginally over its memory.high setting, no throttling will be performed. This composes well with system health monitoring and remediation, as these allocator delays are factored into PSI's memory pressure calculations. This both creates a mechanism system administrators or applications consuming the PSI interface to trivially see that the memcg in question is struggling and use that to make more reasonable decisions, and permits them enough time to act. Either of these can act with significantly more nuance than that we can provide using the system OOM killer. This is a similar idea to memory.oom_control in cgroup v1 which would put the cgroup to sleep if the threshold was violated, but it's also significantly improved as it results in visible memory pressure, and also doesn't schedule indefinitely, which previously made tracing and other introspection difficult (ie. it's clamped at 2HZ per allocation through MEMCG_MAX_HIGH_DELAY_JIFFIES). Contrast the previous results with a kernel with this patch: [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \ > --wait -t timeout 300 /root/mdf [...] 95 1002 96 1000 97 1002 98 1003 99 1000 100 1043 101 84724 102 330628 103 610511 104 1016265 105 1503969 106 2391692 107 2872061 108 3248003 109 4791904 110 5759832 111 6912509 112 8127818 113 9472203 114 12287622 115 12480079 116 14144008 117 15808029 118 16384500 119 16383242 120 16384979 [...] As you can see, in the normal case, memory allocation takes around 1000 usec. However, as we exceed our memory.high, things start to increase exponentially, but fairly leniently at first. Our first megabyte over memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then the next is almost an entire second. This gets worse until we reach our eventual 2HZ clamp per batch, resulting in 16 seconds per megabyte. However, this is still making forward progress, so permits tracing or further analysis with programs like GDB. We use an exponential curve for our delay penalty for a few reasons: 1. We run mem_cgroup_handle_over_high to potentially do reclaim after we've already performed allocations, which means that temporarily going over memory.high by a small amount may be perfectly legitimate, even for compliant workloads. We don't want to unduly penalise such cases. 2. An exponential curve (as opposed to a static or linear delay) allows ramping up memory pressure stats more gradually, which can be useful to work out that you have set memory.high too low, without destroying application performance entirely. This patch expands on earlier work by Johannes Weiner. Thanks! [akpm@linux-foundation.org: fix max() warning] [akpm@linux-foundation.org: fix __udivdi3 ref on 32-bit] [akpm@linux-foundation.org: fix it even more] [chris@chrisdown.name: fix 64-bit divide even more] Link: http://lkml.kernel.org/r/20190723180700.GA29459@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2019-09-24 15:54:08 -07:00
..
kasan	mm: introduce compound_nr()	2019-09-24 15:54:08 -07:00
backing-dev.c	writeback: Separate out wb_get_lookup() from wb_get_create()	2019-08-27 09:22:38 -06:00
balloon_compaction.c	mm/balloon_compaction: suppress allocation warnings	2019-09-04 07:42:01 -04:00
cleancache.c	Driver Core and debugfs changes for 5.3-rc1	2019-07-12 12:24:03 -07:00
cma_debug.c	mm/cma_debug.c: fix the break condition in cma_maxchunk_get()	2019-05-14 09:47:45 -07:00
cma.c	mm/cma.c: fail if fixed declaration can't be honored	2019-07-16 19:23:21 -07:00
cma.h	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
compaction.c	mm: introduce compound_nr()	2019-09-24 15:54:08 -07:00
debug_page_ref.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
debug.c	mm: update references to page _refcount	2019-05-14 19:52:47 -07:00
dmapool.c	mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options	2019-07-12 11:05:46 -07:00
early_ioremap.c	mm/early_ioremap: Fix boot hang with earlyprintk=efi,keep	2017-12-11 14:54:44 +01:00
fadvise.c	fs: Export generic_fadvise()	2019-08-30 22:43:58 -07:00
failslab.c	mm/failslab.c: by default, do not fail allocations with direct reclaim only	2019-07-12 11:05:43 -07:00
filemap.c	mm: page cache: store only head pages in i_pages	2019-09-24 15:54:08 -07:00
frame_vector.c	mm/frame_vector.c: release a semaphore in 'get_vaddr_frames()'	2017-12-14 16:00:48 -08:00
frontswap.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482	2019-06-19 17:09:52 +02:00
gup_benchmark.c	mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM	2019-05-14 09:47:45 -07:00
gup.c	mm: introduce compound_nr()	2019-09-24 15:54:08 -07:00
highmem.c	mm: convert totalram_pages and totalhigh_pages variables to atomic	2018-12-28 12:11:47 -08:00
hmm.c	pagewalk: separate function pointers from iterator data	2019-09-07 04:28:04 -03:00
huge_memory.c	mm: page cache: store only head pages in i_pages	2019-09-24 15:54:08 -07:00
hugetlb_cgroup.c	mm: introduce compound_nr()	2019-09-24 15:54:08 -07:00
hugetlb.c	hugetlbfs: fix hugetlb page migration/fault race causing SIGBUS	2019-08-13 16:06:53 -07:00
hwpoison-inject.c	hwpoison-inject: no need to check return value of debugfs_create functions	2019-06-03 15:39:40 +02:00
init-mm.c	mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids	2018-07-17 09:35:30 +02:00
internal.h	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152	2019-05-30 11:26:32 -07:00
interval_tree.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 248	2019-06-19 17:09:08 +02:00
Kconfig	mm: remove CONFIG_MIGRATE_VMA_HELPER	2019-08-20 09:35:03 -03:00
Kconfig.debug	mm, page_owner, debug_pagealloc: save and dump freeing stack trace	2019-09-24 15:54:08 -07:00
khugepaged.c	mm: page cache: store only head pages in i_pages	2019-09-24 15:54:08 -07:00
kmemleak-test.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 333	2019-06-05 17:37:06 +02:00
kmemleak.c	mm/kmemleak.c: record the current memory pool size	2019-09-24 15:54:07 -07:00
ksm.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482	2019-06-19 17:09:52 +02:00
list_lru.c	mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages	2019-07-12 11:05:44 -07:00
maccess.c	The main changes in this release include:	2019-07-18 11:51:00 -07:00
madvise.c	hmm related patches for 5.4	2019-09-21 10:07:42 -07:00
Makefile	memremap: move from kernel/ to mm/	2019-08-03 07:02:01 -07:00
memblock.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152	2019-05-30 11:26:32 -07:00
memcontrol.c	mm, memcg: throttle allocators when failing reclaim over memory.high	2019-09-24 15:54:08 -07:00
memfd.c	mm: page cache: store only head pages in i_pages	2019-09-24 15:54:08 -07:00
memory_hotplug.c	mm: introduce compound_nr()	2019-09-24 15:54:08 -07:00
memory-failure.c	HMM patches for 5.3	2019-07-14 19:42:11 -07:00
memory.c	vfs: don't allow writes to swap files	2019-08-20 07:55:16 -07:00
mempolicy.c	pagewalk: separate function pointers from iterator data	2019-09-07 04:28:04 -03:00
mempool.c	docs/core-api/mm: fix return value descriptions in mm/	2019-03-05 21:07:20 -08:00
memremap.c	Merge branch 'odp_fixes' into hmm.git	2019-08-21 20:58:18 -03:00
memtest.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
migrate.c	mm: page cache: store only head pages in i_pages	2019-09-24 15:54:08 -07:00
mincore.c	pagewalk: separate function pointers from iterator data	2019-09-07 04:28:04 -03:00
mlock.c	mm/mlock.c: change count_mm_mlocked_page_nr return type	2019-06-13 17:34:56 -10:00
mm_init.c	treewide: Add SPDX license identifier for missed files	2019-05-21 10:50:45 +02:00
mmap.c	vfs: don't allow writes to swap files	2019-08-20 07:55:16 -07:00
mmu_context.c	sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h>	2017-03-02 08:42:38 +01:00
mmu_gather.c	mm: mmu_gather: remove __tlb_reset_range() for force flush	2019-06-13 17:34:56 -10:00
mmu_notifier.c	mm, notifier: Catch sleeping/blocking for !blockable	2019-09-07 04:28:05 -03:00
mmzone.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
mprotect.c	pagewalk: separate function pointers from iterator data	2019-09-07 04:28:04 -03:00
mremap.c	mm/mmu_notifier: contextual information for event triggering invalidation	2019-05-14 09:47:49 -07:00
msync.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
nommu.c	mm: introduce page_size()	2019-09-24 15:54:08 -07:00
oom_kill.c	mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process()	2019-07-12 11:05:47 -07:00
page_alloc.c	mm: introduce compound_nr()	2019-09-24 15:54:08 -07:00
page_counter.c	memcg: introduce memory.min	2018-06-07 17:34:36 -07:00
page_ext.c	mm, debug_pagealloc: use a page type instead of page_ext flag	2019-07-12 11:05:43 -07:00
page_idle.c	mm/page_idle.c: fix oops because end_pfn is larger than max_pfn	2019-06-29 16:43:45 +08:00
page_io.c	mm, swap: use rbtree for swap_extent	2019-07-12 11:05:43 -07:00
page_isolation.c	mm/page_isolation.c: change the prototype of undo_isolate_page_range()	2019-07-12 11:05:43 -07:00
page_owner.c	mm, page_owner, debug_pagealloc: save and dump freeing stack trace	2019-09-24 15:54:08 -07:00
page_poison.c	mm/page_poison.c: fix a typo in a comment	2019-09-24 15:54:08 -07:00
page_vma_mapped.c	mm: introduce page_size()	2019-09-24 15:54:08 -07:00
page-writeback.c	writeback, memcg: Implement foreign dirty flushing	2019-08-27 09:22:38 -06:00
pagewalk.c	pagewalk: use lockdep_assert_held for locking validation	2019-09-07 04:28:04 -03:00
percpu-internal.h	percpu: convert chunk hints to be based on pcpu_block_md	2019-03-13 12:25:31 -07:00
percpu-km.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428	2019-06-05 17:37:16 +02:00
percpu-stats.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428	2019-06-05 17:37:16 +02:00
percpu-vm.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428	2019-06-05 17:37:16 +02:00
percpu.c	percpu: Use struct_size() helper	2019-09-04 13:40:49 -07:00
pgtable-generic.c	x86/mm: Page size aware flush_tlb_mm_range()	2018-10-09 16:51:11 +02:00
process_vm_access.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152	2019-05-30 11:26:32 -07:00
quicklist.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
readahead.c	treewide: Add SPDX license identifier for missed files	2019-05-21 10:50:45 +02:00
rmap.c	mm: introduce compound_nr()	2019-09-24 15:54:08 -07:00
rodata_test.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441	2019-06-05 17:37:17 +02:00
shmem.c	mm: page cache: store only head pages in i_pages	2019-09-24 15:54:08 -07:00
shuffle.c	mm: maintain randomization of page free lists	2019-05-14 19:52:48 -07:00
shuffle.h	mm: maintain randomization of page free lists	2019-05-14 19:52:48 -07:00
slab_common.c	mm, slab: extend slab/shrink to shrink all memcg caches	2019-09-24 15:54:07 -07:00
slab.c	mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options	2019-07-12 11:05:46 -07:00
slab.h	mm, slab: move memcg_cache_params structure to mm/slab.h	2019-09-24 15:54:07 -07:00
slob.c	mm: introduce page_size()	2019-09-24 15:54:08 -07:00
slub.c	mm: introduce page_size()	2019-09-24 15:54:08 -07:00
sparse-vmemmap.c	mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()	2019-07-18 17:08:07 -07:00
sparse.c	mm/sparsemem: cleanup 'section number' data types	2019-07-18 17:08:07 -07:00
swap_cgroup.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
swap_slots.c	mm, swap, get_swap_pages: use entry_size instead of cluster in parameter	2018-08-22 10:52:44 -07:00
swap_state.c	mm: page cache: store only head pages in i_pages	2019-09-24 15:54:08 -07:00
swap.c	mm: replace list_move_tail() with add_page_to_lru_list_tail()	2019-09-24 15:54:08 -07:00
swapfile.c	vfs: don't allow writes to swap files	2019-08-20 07:55:16 -07:00
truncate.c	treewide: Add SPDX license identifier for missed files	2019-05-21 10:50:45 +02:00
usercopy.c	mm/usercopy: use memory range to be accessed for wraparound check	2019-08-13 16:06:52 -07:00
userfaultfd.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499	2019-06-19 17:09:53 +02:00
util.c	mm: introduce compound_nr()	2019-09-24 15:54:08 -07:00
vmacache.c	mm: get rid of vmacache_flush_all() entirely	2018-09-13 15:18:04 -10:00
vmalloc.c	vmalloc: lift the arm flag for coherent mappings to common code	2019-09-04 11:13:19 +02:00
vmpressure.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500	2019-06-19 17:09:55 +02:00
vmscan.c	mm: introduce compound_nr()	2019-09-24 15:54:08 -07:00
vmstat.c	treewide: Add SPDX license identifier for missed files	2019-05-21 10:50:45 +02:00
workingset.c	mm: workingset: fix vmstat counters for shadow nodes	2019-08-13 16:06:52 -07:00
z3fold.c	z3fold: fix retry mechanism in page reclaim	2019-09-24 15:54:06 -07:00
zbud.c	treewide: Add SPDX license identifier for more missed files	2019-05-21 10:50:45 +02:00
zpool.c	treewide: Add SPDX license identifier for more missed files	2019-05-21 10:50:45 +02:00
zsmalloc.c	mm/zsmalloc.c: fix build when CONFIG_COMPACTION=n	2019-08-30 18:00:50 -07:00
zswap.c	zswap: ignore debugfs_create_dir() return value	2019-06-03 15:39:39 +02:00