linux_dsm_epyc7002/mm
Mel Gorman 67d46b296a mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
Rob van der Heij reported the following (paraphrased) on private mail.

	The scenario is that I want to avoid backups to fill up the page
	cache and purge stuff that is more likely to be used again (this is
	with s390x Linux on z/VM, so I don't give it as much memory that
	we don't care anymore). So I have something with LD_PRELOAD that
	intercepts the close() call (from tar, in this case) and issues
	a posix_fadvise() just before closing the file.

	This mostly works, except for small files (less than 14 pages)
	that remains in page cache after the face.

Unfortunately Rob has not had a chance to test this exact patch but the
test program below should be reproducing the problem he described.

The issue is the per-cpu pagevecs for LRU additions.  If the pages are
added by one CPU but fadvise() is called on another then the pages
remain resident as the invalidate_mapping_pages() only drains the local
pagevecs via its call to pagevec_release().  The user-visible effect is
that a program that uses fadvise() properly is not obeyed.

A possible fix for this is to put the necessary smarts into
invalidate_mapping_pages() to globally drain the LRU pagevecs if a
pagevec page could not be discarded.  The downside with this is that an
inode cache shrink would send a global IPI and memory pressure
potentially causing global IPI storms is very undesirable.

Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
check if invalidate_mapping_pages() discarded all the requested pages.
If a subset of pages are discarded it drains the LRU pagevecs and tries
again.  If the second attempt fails, it assumes it is due to the pages
being mapped, locked or dirty and does not care.  With this patch, an
application using fadvise() correctly will be obeyed but there is a
downside that a malicious application can force the kernel to send
global IPIs and increase overhead.

If accepted, I would like this to be considered as a -stable candidate.
It's not an urgent issue but it's a system call that is not working as
advertised which is weak.

The following test program demonstrates the problem.  It should never
report that pages are still resident but will without this patch.  It
assumes that CPU 0 and 1 exist.

int main() {
	int fd;
	int pagesize = getpagesize();
	ssize_t written = 0, expected;
	char *buf;
	unsigned char *vec;
	int resident, i;
	cpu_set_t set;

	/* Prepare a buffer for writing */
	expected = FILESIZE_PAGES * pagesize;
	buf = malloc(expected + 1);
	if (buf == NULL) {
		printf("ENOMEM\n");
		exit(EXIT_FAILURE);
	}
	buf[expected] = 0;
	memset(buf, 'a', expected);

	/* Prepare the mincore vec */
	vec = malloc(FILESIZE_PAGES);
	if (vec == NULL) {
		printf("ENOMEM\n");
		exit(EXIT_FAILURE);
	}

	/* Bind ourselves to CPU 0 */
	CPU_ZERO(&set);
	CPU_SET(0, &set);
	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
		perror("sched_setaffinity");
		exit(EXIT_FAILURE);
	}

	/* open file, unlink and write buffer */
	fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
	if (fd == -1) {
		perror("open");
		exit(EXIT_FAILURE);
	}
	unlink("fadvise-test-file");
	while (written < expected) {
		ssize_t this_write;
		this_write = write(fd, buf + written, expected - written);

		if (this_write == -1) {
			perror("write");
			exit(EXIT_FAILURE);
		}

		written += this_write;
	}
	free(buf);

	/*
	 * Force ourselves to another CPU. If fadvise only flushes the local
	 * CPUs pagevecs then the fadvise will fail to discard all file pages
	 */
	CPU_ZERO(&set);
	CPU_SET(1, &set);
	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
		perror("sched_setaffinity");
		exit(EXIT_FAILURE);
	}

	/* sync and fadvise to discard the page cache */
	fsync(fd);
	if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
		perror("posix_fadvise");
		exit(EXIT_FAILURE);
	}

	/* map the file and use mincore to see which parts of it are resident */
	buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
	if (buf == NULL) {
		perror("mmap");
		exit(EXIT_FAILURE);
	}
	if (mincore(buf, expected, vec) == -1) {
		perror("mincore");
		exit(EXIT_FAILURE);
	}

	/* Check residency */
	for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
		if (vec[i])
			resident++;
	}
	if (resident != 0) {
		printf("Nr unexpected pages resident: %d\n", resident);
		exit(EXIT_FAILURE);
	}

	munmap(buf, expected);
	close(fd);
	free(vec);
	exit(EXIT_SUCCESS);
}

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Rob van der Heij <rvdheij@gmail.com>
Tested-by: Rob van der Heij <rvdheij@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:23 -08:00
..
backing-dev.c bdi: allow block devices to say that they require stable page writes 2013-02-21 17:22:19 -08:00
balloon_compaction.c mm: introduce a common interface for balloon pages mobility 2012-12-11 17:22:26 -08:00
bootmem.c mm: Add alloc_bootmem_low_pages_nopanic() 2013-01-29 19:32:59 -08:00
bounce.c block: optionally snapshot page contents to provide stable pages during write 2013-02-21 17:22:20 -08:00
cleancache.c ->encode_fh() API change 2012-05-29 23:28:33 -04:00
compaction.c mm: add & use zone_end_pfn() and zone_spans_pfn() 2013-02-23 17:50:20 -08:00
debug-pagealloc.c mm, x86: Remove debug_pagealloc_enabled 2011-12-06 09:24:07 +01:00
dmapool.c dmapool: make DMAPOOL_DEBUG detect corruption of free marker 2012-12-11 17:22:24 -08:00
fadvise.c mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages 2013-02-23 17:50:23 -08:00
failslab.c switch debugfs to umode_t 2012-01-03 22:54:56 -05:00
filemap_xip.c mm: move all mmu notifier invocations to be done outside the PT lock 2012-10-09 16:22:58 +09:00
filemap.c mm: only enforce stable page writes if the backing device requires it 2013-02-21 17:22:19 -08:00
fremap.c mm: introduce VM_POPULATE flag to better deal with racy userspace programs 2013-02-23 17:50:11 -08:00
frontswap.c frontswap: support exclusive gets if tmem backend is capable 2012-09-21 10:38:12 -04:00
highmem.c Some nice cleanups, and even a patch my wife did as a "live" demo for 2012-12-20 08:37:05 -08:00
huge_memory.c mm: use NUMA_NO_NODE 2013-02-23 17:50:21 -08:00
hugetlb_cgroup.c mm/hugetlb: create hugetlb cgroup file in hugetlb_init 2012-12-18 15:02:15 -08:00
hugetlb.c mm: use long type for page counts in mm_populate() and get_user_pages() 2013-02-23 17:50:22 -08:00
hwpoison-inject.c memcg: rename config variables 2012-07-31 18:42:43 -07:00
init-mm.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
internal.h mm: directly use __mlock_vma_pages_range() in find_extend_vma() 2013-02-23 17:50:11 -08:00
interval_tree.c mm: add CONFIG_DEBUG_VM_RB build option 2012-10-09 16:22:42 +09:00
Kconfig memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap 2013-02-23 17:50:12 -08:00
Kconfig.debug mm: more intensive memory corruption debugging 2012-01-10 16:30:42 -08:00
kmemcheck.c kmemcheck: Fix build errors due to missing slab.h 2010-03-30 22:02:32 +09:00
kmemleak-test.c kmemleak: remove memset by using kzalloc 2011-01-27 18:31:51 +00:00
kmemleak.c mm: add & use zone_end_pfn() and zone_spans_pfn() 2013-02-23 17:50:20 -08:00
ksm.c ksm: stop hotremove lockdep warning 2013-02-23 17:50:20 -08:00
maccess.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
madvise.c mm: make madvise(MADV_WILLNEED) support swap file prefetch 2013-02-23 17:50:10 -08:00
Makefile mm: introduce a common interface for balloon pages mobility 2012-12-11 17:22:26 -08:00
memblock.c mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblock_overlaps_region(). 2013-02-23 17:50:14 -08:00
memcontrol.c memcg: stop warning on memcg_propagate_kmem 2013-02-23 17:50:22 -08:00
memory_hotplug.c mm/memory_hotplug: use pgdat_end_pfn() instead of open coding the same. 2013-02-23 17:50:21 -08:00
memory-failure.c HWPOISON: change order of error_states[]'s elements 2013-02-23 17:50:22 -08:00
memory.c mm: accelerate mm_populate() treatment of THP pages 2013-02-23 17:50:23 -08:00
mempolicy.c mm: use NUMA_NO_NODE 2013-02-23 17:50:21 -08:00
mempool.c mempool: add @gfp_mask to mempool_create_node() 2012-06-25 11:53:47 +02:00
migrate.c mm: remove offlining arg to migrate_pages 2013-02-23 17:50:19 -08:00
mincore.c swap: make each swap partition have one address_space 2013-02-23 17:50:17 -08:00
mlock.c mm: use long type for page counts in mm_populate() and get_user_pages() 2013-02-23 17:50:22 -08:00
mm_init.c mm: init: report on last-nid information stored in page->flags 2013-02-23 17:50:18 -08:00
mmap.c mm/rmap: rename anon_vma_unlock() => anon_vma_unlock_write() 2013-02-23 17:50:17 -08:00
mmu_context.c mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat() 2012-03-21 17:54:59 -07:00
mmu_notifier.c mm: export mmu notifier invalidates 2013-02-23 17:50:23 -08:00
mmzone.c mm: rename page struct field helpers 2013-02-23 17:50:18 -08:00
mprotect.c mm/mprotect.c: coding-style cleanups 2012-12-18 15:02:15 -08:00
mremap.c mm/rmap: rename anon_vma_unlock() => anon_vma_unlock_write() 2013-02-23 17:50:17 -08:00
msync.c sanitize vfs_fsync calling conventions 2010-05-21 18:31:21 -04:00
nobootmem.c mm: Add alloc_bootmem_low_pages_nopanic() 2013-01-29 19:32:59 -08:00
nommu.c mm: accelerate mm_populate() treatment of THP pages 2013-02-23 17:50:23 -08:00
oom_kill.c memcg, oom: provide more precise dump info while memcg oom happening 2013-02-23 17:50:08 -08:00
page_alloc.c mm: accurately document nr_free_*_pages functions with code comments 2013-02-23 17:50:22 -08:00
page_cgroup.c memcontrol: use N_MEMORY instead N_HIGH_MEMORY 2012-12-12 17:38:32 -08:00
page_io.c mm: add support for direct_IO to highmem pages 2012-07-31 18:42:47 -07:00
page_isolation.c mm: fix zone_watermark_ok_safe() accounting of isolated pages 2013-01-04 16:11:46 -08:00
page-writeback.c page-writeback.c: subtract min_free_kbytes from dirtyable memory 2013-02-23 17:50:17 -08:00
pagewalk.c thp: change split_huge_page_pmd() interface 2012-12-12 17:38:31 -08:00
percpu-km.c percpu: clear memory allocated with the km allocator 2010-10-02 10:28:42 +03:00
percpu-vm.c mm: fix kernel-doc warnings 2012-06-20 14:39:36 -07:00
percpu.c mm, percpu: Make sure percpu_alloc early parameter has an argument 2012-12-02 06:23:04 -08:00
pgtable-generic.c mm: Only flush the TLB when clearing an accessible pte 2012-12-11 14:28:34 +00:00
process_vm_access.c aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector() 2012-05-31 17:49:32 -07:00
quicklist.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
readahead.c switch simple cases of fget_light to fdget 2012-09-26 22:20:08 -04:00
rmap.c mm/rmap: rename anon_vma_unlock() => anon_vma_unlock_write() 2013-02-23 17:50:17 -08:00
shmem.c mm: shmem: use new radix tree iterator 2013-02-23 17:50:20 -08:00
slab_common.c slab: propagate tunable values 2012-12-18 15:02:14 -08:00
slab.c memcg: add comments clarifying aspects of cache attribute propagation 2012-12-18 15:02:15 -08:00
slab.h slab: propagate tunable values 2012-12-18 15:02:14 -08:00
slob.c mm: rename page struct field helpers 2013-02-23 17:50:18 -08:00
slub.c mm: rename page struct field helpers 2013-02-23 17:50:18 -08:00
sparse-vmemmap.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
sparse.c memory-failure: use num_poisoned_pages instead of mce_bad_pages 2013-02-23 17:50:15 -08:00
swap_state.c swap: add per-partition lock for swapfile 2013-02-23 17:50:17 -08:00
swap.c swap: make each swap partition have one address_space 2013-02-23 17:50:17 -08:00
swapfile.c swap: add per-partition lock for swapfile 2013-02-23 17:50:17 -08:00
truncate.c mm: drop vmtruncate 2012-12-20 18:46:29 -05:00
util.c swap: make each swap partition have one address_space 2013-02-23 17:50:17 -08:00
vmalloc.c mm: use NUMA_NO_NODE 2013-02-23 17:50:21 -08:00
vmscan.c vmscan: change type of vm_total_pages to unsigned long 2013-02-23 17:50:22 -08:00
vmstat.c mm: add & use zone_end_pfn() and zone_spans_pfn() 2013-02-23 17:50:20 -08:00