linux_dsm_epyc7002/mm
Yafang Shao 9066e5cfb7 mm, oom: make the calculation of oom badness more accurate
Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process.  Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
[7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
[7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
[7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
[7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
[7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
[7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
[7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
[7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
[7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
[7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
[7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
[7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
[7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
[7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
[7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
[7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
[7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
[7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
[7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
[7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page.  That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1.  Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value.  As a result, the first scanned
process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.

To fix this issue, we should make the calculation of oom point more
accurate.  We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
  Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:56 -07:00
..
kasan Kbuild updates for v5.9 2020-08-09 14:10:26 -07:00
backing-dev.c writeback: remove struct bdi_writeback_congested 2020-07-08 17:05:53 -06:00
balloon_compaction.c mm/balloon_compaction: suppress allocation warnings 2019-09-04 07:42:01 -04:00
cleancache.c Driver Core and debugfs changes for 5.3-rc1 2019-07-12 12:24:03 -07:00
cma_debug.c debugfs: make sure we can remove u32_array files cleanly 2020-07-10 13:54:00 -07:00
cma.c mm/cma.c: use exact_nid true to fix possible per-numa cma leak 2020-07-03 16:15:25 -07:00
cma.h debugfs: make sure we can remove u32_array files cleanly 2020-07-10 13:54:00 -07:00
compaction.c mm/compaction: correct the comments of compact_defer_shift 2020-08-12 10:57:56 -07:00
debug_page_ref.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
debug_vm_pgtable.c Documentation/mm: add descriptions for arch page table helpers 2020-08-07 11:33:23 -07:00
debug.c mm, dump_page: do not crash with bad compound_mapcount() 2020-08-07 11:33:23 -07:00
dmapool.c mm/dmapool.c: micro-optimisation remove unnecessary branch 2020-04-07 10:43:42 -07:00
early_ioremap.c mm/early_ioremap.c: use %pa to print resource_size_t variables 2020-01-31 10:30:38 -08:00
fadvise.c mm: return void from various readahead functions 2020-06-02 10:59:06 -07:00
failslab.c mm/failslab.c: by default, do not fail allocations with direct reclaim only 2019-07-12 11:05:43 -07:00
filemap.c mm: filemap: add missing FGP_ flags in kerneldoc comment for pagecache_get_page 2020-08-07 11:33:23 -07:00
frame_vector.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
frontswap.c treewide: Remove uninitialized_var() usage 2020-07-16 12:35:15 -07:00
gup_benchmark.c mm/gup_benchmark: support pin_user_pages() and related calls 2020-04-02 09:35:27 -07:00
gup.c mm/gup.c: fix the comment of return value for populate_vma_page_range() 2020-08-07 11:33:23 -07:00
highmem.c mm, x86/mm: Untangle address space layout definitions from basic pgtable type definitions 2019-12-10 10:12:55 +01:00
hmm.c mm/hmm: provide the page mapping order in hmm_range_fault() 2020-07-10 16:24:28 -03:00
huge_memory.c mm/vmscan: protect the workingset on anonymous LRU 2020-08-12 10:57:55 -07:00
hugetlb_cgroup.c mm: use fallthrough; 2020-04-07 10:43:41 -07:00
hugetlb.c mm/hugetlb: add mempolicy check in the reservation routine 2020-08-12 10:57:55 -07:00
hwpoison-inject.c mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops 2019-12-01 12:59:09 -08:00
init-mm.c mmap locking API: add MMAP_LOCK_INITIALIZER 2020-06-09 09:39:14 -07:00
internal.h mm: proactive compaction 2020-08-12 10:57:56 -07:00
interval_tree.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 248 2019-06-19 17:09:08 +02:00
ioremap.c mm: move p?d_alloc_track to separate header file 2020-08-07 11:33:26 -07:00
Kconfig mm/sparse: cleanup the code surrounding memory_present() 2020-08-07 11:33:27 -07:00
Kconfig.debug treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
khugepaged.c mm/vmscan: protect the workingset on anonymous LRU 2020-08-12 10:57:55 -07:00
kmemleak-test.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 333 2019-06-05 17:37:06 +02:00
kmemleak.c mm/kmemleak.c: use address-of operator on section symbols 2020-04-02 09:35:26 -07:00
ksm.c powerpc updates for 5.9 2020-08-07 10:33:50 -07:00
list_lru.c mm/list_lru.c: Rename kvfree_rcu() to local variant 2020-06-29 11:59:25 -07:00
maccess.c maccess: rename probe_user_{read,write} to copy_{from,to}_user_nofault 2020-06-17 10:57:41 -07:00
madvise.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
Makefile mm: move lib/ioremap.c to mm/ 2020-08-07 11:33:26 -07:00
mapping_dirty_helpers.c mm/mapping_dirty_helpers: update huge page-table entry callbacks 2020-04-02 09:35:29 -07:00
memblock.c mm/memblock: expose only miminal interface to add/walk physmem 2020-07-10 15:08:09 +02:00
memcontrol.c mm/workingset: prepare the workingset detection infrastructure for anon LRU 2020-08-12 10:57:55 -07:00
memfd.c mm: page cache: store only head pages in i_pages 2019-09-24 15:54:08 -07:00
memory_hotplug.c mm/memory_hotplug: document why shuffle_zone() is relevant 2020-08-07 11:33:29 -07:00
memory-failure.c mm/memory-failure: send SIGBUS(BUS_MCEERR_AR) only to current thread 2020-06-11 18:17:47 -07:00
memory.c mm/swap: implement workingset detection for anonymous LRU 2020-08-12 10:57:56 -07:00
mempolicy.c mm/mempolicy.c: check parameters first in kernel_get_mempolicy 2020-08-12 10:57:56 -07:00
mempool.c docs/core-api/mm: fix return value descriptions in mm/ 2019-03-05 21:07:20 -08:00
memremap.c mm/memremap: set caching mode for PCI P2PDMA memory to WC 2020-04-10 15:36:21 -07:00
memtest.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
migrate.c mm/vmscan: protect the workingset on anonymous LRU 2020-08-12 10:57:55 -07:00
mincore.c mmap locking API: use coccinelle to convert mmap_sem rwsem call sites 2020-06-09 09:39:14 -07:00
mlock.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
mm_init.c mm: adjust vm_committed_as_batch according to vm overcommit policy 2020-08-07 11:33:26 -07:00
mmap.c mm: remove unnecessary wrapper function do_mmap_pgoff() 2020-08-07 11:33:27 -07:00
mmu_gather.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
mmu_notifier.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
mmzone.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
mprotect.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
mremap.c mm/mremap: start addresses are properly aligned 2020-08-07 11:33:27 -07:00
msync.c mmap locking API: use coccinelle to convert mmap_sem rwsem call sites 2020-06-09 09:39:14 -07:00
nommu.c mm: remove unnecessary wrapper function do_mmap_pgoff() 2020-08-07 11:33:27 -07:00
oom_kill.c mm, oom: make the calculation of oom badness more accurate 2020-08-12 10:57:56 -07:00
page_alloc.c mm/page_alloc: fix memalloc_nocma_{save/restore} APIs 2020-08-07 11:33:29 -07:00
page_counter.c mm/page_counter.c: fix protection usage propagation 2020-08-07 11:33:26 -07:00
page_ext.c mm/page_ext.c: drop pfn_present() check when onlining 2020-04-07 10:43:40 -07:00
page_idle.c mm/page_idle.c: skip offline pages 2020-06-08 11:05:55 -07:00
page_io.c mm/page_io.c: use blk_io_schedule() for avoiding task hung in sync io 2020-08-07 11:33:24 -07:00
page_isolation.c mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE 2020-06-04 15:36:52 -04:00
page_owner.c mm: rename gfpflags_to_migratetype to gfp_migratetype for same convention 2020-06-03 20:09:45 -07:00
page_poison.c mm/page_poison.c: fix a typo in a comment 2019-09-24 15:54:08 -07:00
page_reporting.c mm/page_reporting: add budget limit on how many pages can be reported per pass 2020-04-07 10:43:39 -07:00
page_reporting.h mm: introduce include/linux/pgtable.h 2020-06-09 09:39:13 -07:00
page_vma_mapped.c mm/page_vma_mapped.c: explicitly compare pfn for normal, hugetlbfs and THP page 2020-01-31 10:30:38 -08:00
page-writeback.c mm: remove vm_total_pages 2020-08-07 11:33:28 -07:00
pagewalk.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
percpu-internal.h mm: memcg/percpu: account percpu memory to memory cgroups 2020-08-12 10:57:55 -07:00
percpu-km.c mm: memcg/percpu: account percpu memory to memory cgroups 2020-08-12 10:57:55 -07:00
percpu-stats.c mm: memcg/percpu: account percpu memory to memory cgroups 2020-08-12 10:57:55 -07:00
percpu-vm.c mm: memcg/percpu: account percpu memory to memory cgroups 2020-08-12 10:57:55 -07:00
percpu.c mm: memcg/percpu: per-memcg percpu memory statistics 2020-08-12 10:57:55 -07:00
pgalloc-track.h mm: move p?d_alloc_track to separate header file 2020-08-07 11:33:26 -07:00
pgtable-generic.c mm: introduce include/linux/pgtable.h 2020-06-09 09:39:13 -07:00
process_vm_access.c mmap locking API: use coccinelle to convert mmap_sem rwsem call sites 2020-06-09 09:39:14 -07:00
ptdump.c mmap locking API: use coccinelle to convert mmap_sem rwsem call sites 2020-06-09 09:39:14 -07:00
readahead.c mm: use memalloc_nofs_save in readahead path 2020-06-02 10:59:07 -07:00
rmap.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
rodata_test.c maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault 2020-06-17 10:57:41 -07:00
shmem.c mm/swapcache: support to handle the shadow entries 2020-08-12 10:57:55 -07:00
shuffle.c mm/shuffle: remove dynamic reconfiguration 2020-08-07 11:33:29 -07:00
shuffle.h mm/shuffle: remove dynamic reconfiguration 2020-08-07 11:33:29 -07:00
slab_common.c mm: memcg/slab: use a single set of kmem_caches for all allocations 2020-08-07 11:33:25 -07:00
slab.c mm: slab: rename (un)charge_slab_page() to (un)account_slab_page() 2020-08-07 11:33:25 -07:00
slab.h mm: slab: rename (un)charge_slab_page() to (un)account_slab_page() 2020-08-07 11:33:25 -07:00
slob.c mm: memcg: convert vmstat slab counters to bytes 2020-08-07 11:33:24 -07:00
slub.c mm: slab: rename (un)charge_slab_page() to (un)account_slab_page() 2020-08-07 11:33:25 -07:00
sparse-vmemmap.c mm/sparse: only sub-section aligned range would be populated 2020-08-07 11:33:27 -07:00
sparse.c mm/sparse: cleanup the code surrounding memory_present() 2020-08-07 11:33:27 -07:00
swap_cgroup.c mm: memcontrol: make swap tracking an integral part of memory control 2020-06-03 20:09:48 -07:00
swap_slots.c mm/swap_slots.c: remove redundant check for swap_slot_cache_initialized 2020-08-07 11:33:24 -07:00
swap_state.c mm/swap: implement workingset detection for anonymous LRU 2020-08-12 10:57:56 -07:00
swap.c mm/vmscan: protect the workingset on anonymous LRU 2020-08-12 10:57:55 -07:00
swapfile.c mm/swapcache: support to handle the shadow entries 2020-08-12 10:57:55 -07:00
truncate.c mm/thp: allow dropping THP from page cache 2019-10-19 06:32:33 -04:00
usercopy.c usercopy: Avoid HIGHMEM pfn warning 2019-09-17 15:20:17 -07:00
userfaultfd.c mm/vmscan: protect the workingset on anonymous LRU 2020-08-12 10:57:55 -07:00
util.c mm: remove unnecessary wrapper function do_mmap_pgoff() 2020-08-07 11:33:27 -07:00
vmacache.c kernel: better document the use_mm/unuse_mm API contract 2020-06-10 19:14:18 -07:00
vmalloc.c mm/vmalloc.c: remove BUG() from the find_va_links() 2020-08-07 11:33:28 -07:00
vmpressure.c mm: vmpressure: use mem_cgroup_is_root API 2020-04-02 09:35:31 -07:00
vmscan.c mm/vmscan: restore active/inactive ratio for anonymous LRU 2020-08-12 10:57:56 -07:00
vmstat.c mm: use unsigned types for fragmentation score 2020-08-12 10:57:56 -07:00
workingset.c mm/swap: implement workingset detection for anonymous LRU 2020-08-12 10:57:56 -07:00
z3fold.c mm/z3fold: silence kmemleak false positives of slots 2020-05-28 11:35:40 -07:00
zbud.c mm: use false for bool variable 2020-06-04 19:06:24 -07:00
zpool.c zpool: add malloc_support_movable to zpool_driver 2019-09-24 15:54:12 -07:00
zsmalloc.c mm: reorder includes after introduction of linux/pgtable.h 2020-06-09 09:39:13 -07:00
zswap.c mm/zswap: allow setting default status, compressor and allocator in Kconfig 2020-04-07 10:43:41 -07:00