linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-11-24 14:51:00 +07:00

History

Johannes Weiner a756cf5908 mm: try to distribute dirty pages fairly across zones The maximum number of dirty pages that exist in the system at any time is determined by a number of pages considered dirtyable and a user-configured percentage of those, or an absolute number in bytes. This number of dirtyable pages is the sum of memory provided by all the zones in the system minus their lowmem reserves and high watermarks, so that the system can retain a healthy number of free pages without having to reclaim dirty pages. But there is a flaw in that we have a zoned page allocator which does not care about the global state but rather the state of individual memory zones. And right now there is nothing that prevents one zone from filling up with dirty pages while other zones are spared, which frequently leads to situations where kswapd, in order to restore the watermark of free pages, does indeed have to write pages from that zone's LRU list. This can interfere so badly with IO from the flusher threads that major filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim already, taking away the VM's only possibility to keep such a zone balanced, aside from hoping the flushers will soon clean pages from that zone. Enter per-zone dirty limits. They are to a zone's dirtyable memory what the global limit is to the global amount of dirtyable memory, and try to make sure that no single zone receives more than its fair share of the globally allowed dirty pages in the first place. As the number of pages considered dirtyable excludes the zones' lowmem reserves and high watermarks, the maximum number of dirty pages in a zone is such that the zone can always be balanced without requiring page cleaning. As this is a placement decision in the page allocator and pages are dirtied only after the allocation, this patch allows allocators to pass __GFP_WRITE when they know in advance that the page will be written to and become dirty soon. The page allocator will then attempt to allocate from the first zone of the zonelist - which on NUMA is determined by the task's NUMA memory policy - that has not exceeded its dirty limit. At first glance, it would appear that the diversion to lower zones can increase pressure on them, but this is not the case. With a full high zone, allocations will be diverted to lower zones eventually, so it is more of a shift in timing of the lower zone allocations. Workloads that previously could fit their dirty pages completely in the higher zone may be forced to allocate from lower zones, but the amount of pages that "spill over" are limited themselves by the lower zones' dirty constraints, and thus unlikely to become a problem. For now, the problem of unfair dirty page distribution remains for NUMA configurations where the zones allowed for allocation are in sum not big enough to trigger the global dirty limits, wake up the flusher threads and remedy the situation. Because of this, an allocation that could not succeed on any of the considered zones is allowed to ignore the dirty limits before going into direct reclaim or even failing the allocation, until a future patch changes the global dirty throttling and flusher thread activation so that they take individual zone states into account. Test results 15M DMA + 3246M DMA32 + 504 Normal = 3765M memory 40% dirty ratio 16G USB thumb drive 10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15)) seconds nr_vmscan_write (stddev) min\| median\| max xfs vanilla: 549.747( 3.492) 0.000\| 0.000\| 0.000 patched: 550.996( 3.802) 0.000\| 0.000\| 0.000 fuse-ntfs vanilla: 1183.094(53.178) 54349.000\| 59341.000\| 65163.000 patched: 558.049(17.914) 0.000\| 0.000\| 43.000 btrfs vanilla: 573.679(14.015) 156657.000\| 460178.000\| 606926.000 patched: 563.365(11.368) 0.000\| 0.000\| 1362.000 ext4 vanilla: 561.197(15.782) 0.000\|2725438.000\|4143837.000 patched: 568.806(17.496) 0.000\| 0.000\| 0.000 Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Tested-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2012-01-10 16:30:43 -08:00
..
backing-dev.c	freezer: implement and use kthread_freezable_should_stop()	2011-11-21 12:32:23 -08:00
bootmem.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
bounce.c	Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux	2011-11-06 19:44:47 -08:00
cleancache.c	mm: cleancache core ops functions and config	2011-05-26 10:01:36 -06:00
compaction.c	convert 'memory' sysdev_class to a regular subsystem	2011-12-21 14:48:43 -08:00
debug-pagealloc.c	debug-pagealloc: add support for highmem pages	2011-10-31 17:30:48 -07:00
dmapool.c	mm: fix implicit stat.h usage in dmapool.c	2011-10-31 09:20:12 -04:00
fadvise.c	fadvise: only initiate writeback for specified range with FADV_DONTNEED	2012-01-10 16:30:43 -08:00
failslab.c	switch debugfs to umode_t	2012-01-03 22:54:56 -05:00
filemap_xip.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
filemap.c	should_remove_suid(): inode->i_mode is umode_t	2012-01-03 22:55:14 -05:00
fremap.c	mm: delete various needless include <linux/module.h>	2011-10-31 09:20:11 -04:00
highmem.c	Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux	2011-11-06 19:44:47 -08:00
huge_memory.c	thp: reduce khugepaged freezing latency	2011-12-09 07:50:28 -08:00
hugetlb.c	mm/hugetlb.c: fix virtual address handling in hugetlb fault	2012-01-10 16:30:42 -08:00
hwpoison-inject.c	Fix common misspellings	2011-03-31 11:26:23 -03:00
init-mm.c	atomic: use <linux/atomic.h>	2011-07-26 16:49:47 -07:00
internal.h	mm: thp: tail page refcounting fix	2011-11-02 16:06:57 -07:00
Kconfig	Merge branch 'master' into x86/memblock	2011-11-28 09:46:22 -08:00
Kconfig.debug	mm: more intensive memory corruption debugging	2012-01-10 16:30:42 -08:00
kmemcheck.c	kmemcheck: Fix build errors due to missing slab.h	2010-03-30 22:02:32 +09:00
kmemleak-test.c	kmemleak: remove memset by using kzalloc	2011-01-27 18:31:51 +00:00
kmemleak.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
ksm.c	oom: fix race while temporarily setting current's oom_score_adj	2011-10-31 17:30:45 -07:00
maccess.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
madvise.c	fs: kill i_alloc_sem	2011-07-20 20:47:46 -04:00
Makefile	Cross Memory Attach	2011-10-31 17:30:44 -07:00
memblock.c	memblock: Reimplement memblock allocation using reverse free area iterator	2011-12-08 10:22:09 -08:00
memcontrol.c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2012-01-09 14:46:52 -08:00
memory_hotplug.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
memory-failure.c	Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux	2011-11-06 19:44:47 -08:00
memory.c	Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux	2011-11-06 19:44:47 -08:00
mempolicy.c	mm/mempolicy.c: refix mbind_range() vma issue	2011-12-29 16:31:57 -08:00
mempool.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
migrate.c	mm: migrate: one less atomic operation	2012-01-10 16:30:41 -08:00
mincore.c	mm: clarify the radix_tree exceptional cases	2011-08-03 14:25:24 -10:00
mlock.c	Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux	2011-11-06 19:44:47 -08:00
mm_init.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
mmap.c	Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux	2011-11-06 19:44:47 -08:00
mmu_context.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
mmu_notifier.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
mmzone.c	mm: delete various needless include <linux/module.h>	2011-10-31 09:20:11 -04:00
mprotect.c	thp: mprotect: transparent huge page support	2011-01-13 17:32:44 -08:00
mremap.c	thp: mremap support and TLB optimization	2011-10-31 17:30:48 -07:00
msync.c	sanitize vfs_fsync calling conventions	2010-05-21 18:31:21 -04:00
nobootmem.c	Merge branch 'master' into x86/memblock	2011-11-28 09:46:22 -08:00
nommu.c	xen: map foreign pages for shared rings by updating the PTEs directly	2011-11-16 12:13:08 -05:00
oom_kill.c	Merge branch 'master' into pm-sleep	2011-12-21 21:59:45 +01:00
page_alloc.c	mm: try to distribute dirty pages fairly across zones	2012-01-10 16:30:43 -08:00
page_cgroup.c	mm/page_cgroup.c: quiet sparse noise	2011-11-02 16:07:00 -07:00
page_io.c	block: kill off REQ_UNPLUG	2011-03-10 08:52:27 +01:00
page_isolation.c	mm: page_isolation: codeclean fix comment and rm unneeded val init	2010-10-26 16:52:11 -07:00
page-writeback.c	mm: try to distribute dirty pages fairly across zones	2012-01-10 16:30:43 -08:00
pagewalk.c	pagewalk: fix code comment for THP	2011-07-25 20:57:09 -07:00
percpu-km.c	percpu: clear memory allocated with the km allocator	2010-10-02 10:28:42 +03:00
percpu-vm.c	percpu: fix chunk range calculation	2011-11-22 08:09:46 -08:00
percpu.c	percpu: fix per_cpu_ptr_to_phys() handling of non-page-aligned addresses	2011-12-15 11:41:40 -08:00
pgtable-generic.c	mm/pgtable-generic.c: fix CONFIG_SWAP=n build	2011-01-26 10:49:58 +10:00
prio_tree.c	sanitize <linux/prefetch.h> usage	2011-05-20 12:50:29 -07:00
process_vm_access.c	Cross Memory Attach	2011-10-31 17:30:44 -07:00
quicklist.c	mm: delete various needless include <linux/module.h>	2011-10-31 09:20:11 -04:00
readahead.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
rmap.c	Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux	2011-11-06 19:44:47 -08:00
shmem.c	vfs: switch ->show_options() to struct dentry *	2012-01-06 23:19:54 -05:00
slab.c	tracing/mm: Move include of trace/events/kmem.h out of header into slab.c	2012-01-09 14:19:33 -08:00
slob.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
slub.c	slub: min order when debug_guardpage_minorder > 0	2012-01-10 16:30:43 -08:00
sparse-vmemmap.c	mm: delete various needless include <linux/module.h>	2011-10-31 09:20:11 -04:00
sparse.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
swap_state.c	fs: move code out of buffer.c	2012-01-03 22:54:07 -05:00
swap.c	mm: add free_hot_cold_page_list() helper	2012-01-10 16:30:41 -08:00
swapfile.c	mm: avoid livelock on !__GFP_FS allocations	2012-01-10 16:30:42 -08:00
thrash.c	mm/thrash.c: quiet sparse noise	2011-10-31 17:30:50 -07:00
truncate.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
util.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
vmalloc.c	Merge branch 'devel-stable' into for-linus	2012-01-05 13:24:33 +00:00
vmscan.c	vmscan: add task name to warn_scan_unevictable() messages	2012-01-10 16:30:43 -08:00
vmstat.c	mm/vmstat.c: cache align vm_stat	2011-10-31 17:30:51 -07:00