linux_dsm_epyc7002/mm
Johannes Weiner 8a931f8013 mm: memcontrol: recursive memory.low protection
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says.  The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.

Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.

Consider memory in a data center host.  At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing.  Both branches are further subdivided into individual
services, job components etc.

We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other.  Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree.  Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general.  This
results in very inefficient resource distribution.

Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level.  Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads.  It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g.  rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.

It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.

To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.

That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now.  However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size.  Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree.  But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.

E.g.  a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.

This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.

The floating protection composes naturally with fixed protection.
Consider the following example tree:

		A            A: low = 2G
               / \          A1: low = 1G
              A1 A2         A2: low = 0G

As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.

There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree.  Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior.  Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:28 -07:00
..
kasan RISC-V Patches for the 5.6 Merge Window, Part 1 2020-01-31 11:23:29 -08:00
backing-dev.c memcg: fix a crash in wb_workfn when a device disappears 2020-01-31 10:30:36 -08:00
balloon_compaction.c mm/balloon_compaction: suppress allocation warnings 2019-09-04 07:42:01 -04:00
cleancache.c Driver Core and debugfs changes for 5.3-rc1 2019-07-12 12:24:03 -07:00
cma_debug.c mm/cma_debug.c: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops 2019-12-01 12:59:09 -08:00
cma.c mm/cma.c: switch to bitmap_zalloc() for cma bitmap allocation 2019-12-01 12:59:09 -08:00
cma.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
compaction.c mm, compaction: fix wrong pfn handling in __reset_isolation_pfn() 2019-10-14 15:04:01 -07:00
debug_page_ref.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
debug.c mm: dump_page(): additional diagnostics for huge pinned pages 2020-04-02 09:35:27 -07:00
dmapool.c mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options 2019-07-12 11:05:46 -07:00
early_ioremap.c mm/early_ioremap.c: use %pa to print resource_size_t variables 2020-01-31 10:30:38 -08:00
fadvise.c fs: Export generic_fadvise() 2019-08-30 22:43:58 -07:00
failslab.c mm/failslab.c: by default, do not fail allocations with direct reclaim only 2019-07-12 11:05:43 -07:00
filemap.c mm/filemap.c: rewrite pagecache_get_page documentation 2020-04-02 09:35:27 -07:00
frame_vector.c mm: untag user pointers in get_vaddr_frames 2019-09-25 17:51:41 -07:00
frontswap.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482 2019-06-19 17:09:52 +02:00
gup_benchmark.c mm/gup_benchmark: support pin_user_pages() and related calls 2020-04-02 09:35:27 -07:00
gup.c mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path 2020-04-02 09:35:27 -07:00
highmem.c mm, x86/mm: Untangle address space layout definitions from basic pgtable type definitions 2019-12-10 10:12:55 +01:00
hmm.c mm: pagewalk: add 'depth' parameter to pte_hole 2020-02-04 03:05:25 +00:00
huge_memory.c mm/gup: track FOLL_PIN pages 2020-04-02 09:35:27 -07:00
hugetlb_cgroup.c hugetlb_cgroup: fix illegal access to memory 2020-03-29 09:47:05 -07:00
hugetlb.c mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages 2020-04-02 09:35:27 -07:00
hwpoison-inject.c mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops 2019-12-01 12:59:09 -08:00
init-mm.c mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch 2019-10-19 06:32:32 -04:00
internal.h mm: swap: make page_evictable() inline 2020-04-02 09:35:27 -07:00
interval_tree.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 248 2019-06-19 17:09:08 +02:00
Kconfig mm/Kconfig: fix trivial help text punctuation 2019-12-01 12:59:10 -08:00
Kconfig.debug mm: add generic ptdump 2020-02-04 03:05:25 +00:00
khugepaged.c mm/thp: flush file for !is_shmem PageDirty() case in collapse_file() 2019-12-01 12:59:09 -08:00
kmemleak-test.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 333 2019-06-05 17:37:06 +02:00
kmemleak.c mm/kmemleak.c: use address-of operator on section symbols 2020-04-02 09:35:26 -07:00
ksm.c * PPC secure guest support 2019-12-04 11:08:30 -08:00
list_lru.c mm: memcg/slab: use mem_cgroup_from_obj() 2020-04-02 09:35:28 -07:00
maccess.c uaccess: Add strict non-pagefault kernel-space read function 2019-11-02 12:39:12 -07:00
madvise.c mm: do not allow MADV_PAGEOUT for CoW pages 2020-03-21 18:56:06 -07:00
Makefile mm/Makefile: disable KCSAN for kmemleak 2020-04-02 09:35:26 -07:00
mapping_dirty_helpers.c mm: Add write-protect and clean utilities for address space ranges 2019-11-06 13:03:36 +01:00
memblock.c memblock: Use __func__ in remaining memblock_dbg() call sites 2020-01-31 10:30:38 -08:00
memcontrol.c mm: memcontrol: recursive memory.low protection 2020-04-02 09:35:28 -07:00
memfd.c mm: page cache: store only head pages in i_pages 2019-09-24 15:54:08 -07:00
memory_hotplug.c mm, hotplug: fix page online with DEBUG_PAGEALLOC compiled but not enabled 2020-03-06 07:06:09 -06:00
memory-failure.c mm/memory-failure.c: use page_shift() in add_to_kill() 2019-12-01 12:59:04 -08:00
memory.c mm: avoid data corruption on CoW fault into PFN-mapped VMA 2020-03-06 07:06:09 -06:00
mempolicy.c mm/mempolicy.c: fix out of bounds write in mpol_parse_str() 2020-01-31 10:30:36 -08:00
mempool.c docs/core-api/mm: fix return value descriptions in mm/ 2019-03-05 21:07:20 -08:00
memremap.c mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone() 2020-02-04 03:05:23 +00:00
memtest.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
migrate.c mm: pagewalk: add 'depth' parameter to pte_hole 2020-02-04 03:05:25 +00:00
mincore.c mm: pagewalk: add 'depth' parameter to pte_hole 2020-02-04 03:05:25 +00:00
mlock.c mm: untag user pointers passed to memory syscalls 2019-09-25 17:51:41 -07:00
mm_init.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
mmap.c mm: Avoid creating virtual address aliases in brk()/mmap()/mremap() 2020-02-20 10:03:14 +00:00
mmu_context.c sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h> 2017-03-02 08:42:38 +01:00
mmu_gather.c asm-generic/tlb: provide MMU_GATHER_TABLE_FREE 2020-02-04 03:05:26 +00:00
mmu_notifier.c mm/mmu_notifier: silence PROVE_RCU_LIST warnings 2020-03-21 18:56:06 -07:00
mmzone.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
mprotect.c mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa 2020-03-06 07:06:09 -06:00
mremap.c mm/mremap: Add comment explaining the untagging behaviour of mremap() 2020-03-26 11:28:57 +00:00
msync.c mm: untag user pointers passed to memory syscalls 2019-09-25 17:51:41 -07:00
nommu.c x86/mm: split vmalloc_sync_all() 2020-03-21 18:56:06 -07:00
oom_kill.c mm, oom: dump stack of victim when reaping failed 2020-01-31 10:30:38 -08:00
page_alloc.c mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page() 2020-04-02 09:35:28 -07:00
page_counter.c mm: memcontrol: fix memory.low proportional distribution 2020-04-02 09:35:28 -07:00
page_ext.c mm, page_owner: fix off-by-one error in __set_page_owner_handle() 2019-10-14 15:04:00 -07:00
page_idle.c mm/page_idle.c: fix oops because end_pfn is larger than max_pfn 2019-06-29 16:43:45 +08:00
page_io.c fs: Enable bmap() function to properly return errors 2020-02-03 08:05:37 -05:00
page_isolation.c mm/page_isolation: fix potential warning from user 2020-01-31 10:30:39 -08:00
page_owner.c mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo 2019-10-19 06:32:31 -04:00
page_poison.c mm/page_poison.c: fix a typo in a comment 2019-09-24 15:54:08 -07:00
page_vma_mapped.c mm/page_vma_mapped.c: explicitly compare pfn for normal, hugetlbfs and THP page 2020-01-31 10:30:38 -08:00
page-writeback.c mm/gup/writeback: add callbacks for inaccessible pages 2020-04-02 09:35:27 -07:00
pagewalk.c x86: mm: avoid allocating struct mm_struct on the stack 2020-02-04 03:05:25 +00:00
percpu-internal.h percpu: convert chunk hints to be based on pcpu_block_md 2019-03-13 12:25:31 -07:00
percpu-km.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
percpu-stats.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
percpu-vm.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
percpu.c bitmap: genericize percpu bitmap region iterators 2020-01-20 16:40:56 +01:00
pgtable-generic.c asm-generic/mm: stub out p{4,u}d_clear_bad() if __PAGETABLE_P{4,U}D_FOLDED 2019-12-01 06:29:19 -08:00
process_vm_access.c mm, tree-wide: rename put_user_page*() to unpin_user_page*() 2020-01-31 10:30:38 -08:00
ptdump.c x86: mm: avoid allocating struct mm_struct on the stack 2020-02-04 03:05:25 +00:00
readahead.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
rmap.c mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages 2020-04-02 09:35:27 -07:00
rodata_test.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441 2019-06-05 17:37:17 +02:00
shmem.c tmpfs: deny and force are not huge mount options 2020-02-18 15:07:30 -05:00
shuffle.c mm: fix -Wmissing-prototypes warnings 2019-10-07 15:47:19 -07:00
shuffle.h mm: maintain randomization of page free lists 2019-05-14 19:52:48 -07:00
slab_common.c mm, memcg: fix build error around the usage of kmem_caches 2020-04-02 09:35:28 -07:00
slab.c mm, debug_pagealloc: don't rely on static keys too early 2020-01-13 18:19:02 -08:00
slab.h mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to __memcg_kmem_(un)charge() 2020-04-02 09:35:28 -07:00
slob.c mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two) 2019-10-07 15:47:20 -07:00
slub.c slub: relocate freelist pointer to middle of object 2020-04-02 09:35:26 -07:00
sparse-vmemmap.c mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap() 2019-07-18 17:08:07 -07:00
sparse.c mm/sparse: fix kernel crash with pfn_section_valid check 2020-03-29 09:47:06 -07:00
swap_cgroup.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
swap_slots.c mm/swap_slots.c: assign|reset cache slot by value directly 2020-04-02 09:35:27 -07:00
swap_state.c mm/swap_state.c: use the same way to count page in [add_to|delete_from]_swap_cache 2020-04-02 09:35:28 -07:00
swap.c mm: swap: use smp_mb__after_atomic() to order LRU bit set 2020-04-02 09:35:28 -07:00
swapfile.c mm/swapfile: fix data races in try_to_unuse() 2020-04-02 09:35:27 -07:00
truncate.c mm/thp: allow dropping THP from page cache 2019-10-19 06:32:33 -04:00
usercopy.c usercopy: Avoid HIGHMEM pfn warning 2019-09-17 15:20:17 -07:00
userfaultfd.c mm: fix typos in comments when calling __SetPageUptodate() 2019-12-01 12:59:10 -08:00
util.c mm/mmap.c: rb_parent is not necessary in __vma_link_list() 2019-12-01 06:29:19 -08:00
vmacache.c mm: get rid of vmacache_flush_all() entirely 2018-09-13 15:18:04 -10:00
vmalloc.c x86/mm: split vmalloc_sync_all() 2020-03-21 18:56:06 -07:00
vmpressure.c mm/vmpressure.c: fix a signedness bug in vmpressure_register_event() 2019-10-07 15:47:19 -07:00
vmscan.c mm: swap: make page_evictable() inline 2020-04-02 09:35:27 -07:00
vmstat.c mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting 2020-04-02 09:35:27 -07:00
workingset.c mm: vmscan: detect file thrashing at the reclaim root 2019-12-01 12:59:07 -08:00
z3fold.c mm/z3fold.c: do not include rwlock.h directly 2020-03-06 07:06:09 -06:00
zbud.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
zpool.c zpool: add malloc_support_movable to zpool_driver 2019-09-24 15:54:12 -07:00
zsmalloc.c mm/zsmalloc.c: fix the migrated zspage statistics. 2020-01-04 13:55:09 -08:00
zswap.c zswap: potential NULL dereference on error in init_zswap() 2020-01-31 10:30:39 -08:00