linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-13 11:26:46 +07:00

Author	SHA1	Message	Date
WANG Cong	bbd0682596	mm/sparse.c: improve the error handling for sparse_add_one_section() Improve the error handling for mm/sparse.c::sparse_add_one_section(). And I see no reason to check 'usemap' until holding the 'pgdat_resize_lock'. [geoffrey.levand@am.sony.com: sparse_index_init() returns -EEXIST] Cc: Christoph Lameter <clameter@sgi.com> Acked-by: Dave Hansen <haveblue@us.ibm.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:16 -08:00
WANG Cong	af0cd5a7c3	mm/sparse.c: check the return value of sparse_index_alloc() Since sparse_index_alloc() can return NULL on memory allocation failure, we must deal with the failure condition when calling it. Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:16 -08:00
Geoff Levand	a5ee6daa52	sparsemem: make SPARSEMEM_VMEMMAP selectable SPARSEMEM_VMEMMAP needs to be a selectable config option to support building the kernel both with and without sparsemem vmemmap support. This selection is desirable for platforms which could be configured one way for platform specific builds and the other for multi-platform builds. Signed-off-by: Miguel Botón <mboton@gmail.com> Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:16 -08:00
Adam Litke	72fad7139b	hugetlb: handle write-protection faults in follow_hugetlb_page The follow_hugetlb_page() fix I posted (merged as git commit `5b23dbe817`) missed one case. If the pte is present, but not writable and write access is requested by the caller to get_user_pages(), the code will do the wrong thing. Rather than calling hugetlb_fault to make the pte writable, it notes the presence of the pte and continues. This simple one-liner makes sure we also fault on the pte for this case. Please apply. Signed-off-by: Adam Litke <agl@us.ibm.com> Acked-by: Dave Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-10 19:43:55 -08:00
Linus Torvalds	7fd272550b	Avoid double memclear() in SLOB/SLUB Both slob and slub react to __GFP_ZERO by clearing the allocation, which means that passing the GFP_ZERO bit down to the page allocator is just wasteful and pointless. Acked-by: Matt Mackall <mpm@selenic.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-09 10:17:52 -08:00
Linus Torvalds	ad658cec23	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6: VM/Security: add security hook to do_brk Security: round mmap hint address above mmap_min_addr security: protect from stack expantion into low vm addresses Security: allow capable check to permit mmap or low vm space SELinux: detect dead booleans SELinux: do not clear f_op when removing entries	2007-12-05 09:26:52 -08:00
Eric Paris	ecaf18c15a	VM/Security: add security hook to do_brk Given a specifically crafted binary do_brk() can be used to get low pages available in userspace virtual memory and can thus be used to circumvent the mmap_min_addr low memory protection. Add security checks in do_brk(). Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Alan Cox <alan@redhat.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-05 09:21:21 -08:00
Vegard Nossum	294a80a8ed	SLUB's ksize() fails for size > 2048 I can't pass memory allocated by kmalloc() to ksize() if it is allocated by SLUB allocator and size is larger than (I guess) PAGE_SIZE / 2. The error of ksize() seems to be that it does not check if the allocation was made by SLUB or the page allocator. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Christoph Lameter <clameter@sgi.com>, Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-05 09:21:20 -08:00
Nick Piggin	369b8f5a70	mm: fix XIP file writes Writing to XIP files at a non-page-aligned offset results in data corruption because the writes were always sent to the start of the page. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-05 09:21:20 -08:00
Tetsuo Handa	f8fcc93319	Add EXPORT_SYMBOL(ksize); mm/slub.c exports ksize(), but mm/slob.c and mm/slab.c don't. It's used by binfmt_flat, which can be built as a module. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Christoph Lameter <clameter@sgi.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-05 09:21:18 -08:00
Denis Cheng	4b01a0b161	mm/backing-dev.c: fix percpu_counter_destroy call bug in bdi_init this call should use the array index j, not i. But with this approach, just one int i is enough, int j is not needed. Signed-off-by: Denis Cheng <crquan@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-05 09:21:18 -08:00
Eric Paris	5a211a5dea	VM/Security: add security hook to do_brk Given a specifically crafted binary do_brk() can be used to get low pages available in userspace virtually memory and can thus be used to circumvent the mmap_min_addr low memory protection. Add security checks in do_brk(). Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Alan Cox <alan@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2007-12-06 00:25:30 +11:00
Eric Paris	7cd94146cd	Security: round mmap hint address above mmap_min_addr If mmap_min_addr is set and a process attempts to mmap (not fixed) with a non-null hint address less than mmap_min_addr the mapping will fail the security checks. Since this is just a hint address this patch will round such a hint address above mmap_min_addr. gcj was found to try to be very frugal with vm usage and give hint addresses in the 8k-32k range. Without this patch all such programs failed and with the patch they happily get a higher address. This patch is wrappad in CONFIG_SECURITY since mmap_min_addr doesn't exist without it and there would be no security check possible no matter what. So we should not bother compiling in this rounding if it is just a waste of time. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2007-12-06 00:25:10 +11:00
Eric Paris	8869477a49	security: protect from stack expantion into low vm addresses Add security checks to make sure we are not attempting to expand the stack into memory protected by mmap_min_addr Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2007-12-06 00:24:48 +11:00
Matthew Wilcox	80cbd911ca	Fix kmem_cache_free performance regression in slab The database performance group have found that half the cycles spent in kmem_cache_free are spent in this one call to BUG_ON. Moving it into the CONFIG_SLAB_DEBUG-only function cache_free_debugcheck() is a performance win of almost 0.5% on their particular benchmark. The call was added as part of commit `ddc2e812d5` with the comment that "overhead should be minimal". It may have been minimal at the time, but it isn't now. [ Quoth Pekka Enberg: "I don't think the BUG_ON per se caused the performance regression but rather the virt_to_head_page() changes to virt_to_cache() that were added later." ] Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Acked-by: Pekka J Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-30 08:08:05 -08:00
KAMEZAWA Hiroyuki	e0dc3a53de	memory hotplug fix: fix section mismatch in vmammap_allock_block() Fixes section mismatch below. WARNING: vmlinux.o(.text+0x946b5): Section mismatch: reference to .init.text:' __alloc_bootmem_node (between 'vmemmap_alloc_block' and 'vmemmap_pgd_populate') Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-29 09:24:54 -08:00
Mel Gorman	ba72cb8cb0	Fix boot problem with iSeries lacking hugepage support Ordinarily the size of a pageblock is determined at compile-time based on the hugepage size. On PPC64, the hugepage size is determined at runtime based on what is supported by the machine. With legacy machines such as iSeries that do not support hugepages, HPAGE_SHIFT is 0. This results in pageblock_order being set to -PAGE_SHIFT and a crash results shortly afterwards. This patch adds a function to select a sensible value for pageblock order by default when HUGETLB_PAGE_SIZE_VARIABLE is set. It checks that HPAGE_SHIFT is a sensible value before using the hugepage size; if it is not MAX_ORDER-1 is used. This is a fix for 2.6.24. Credit goes to Stephen Rothwell for identifying the bug and testing candidate patches. Additional credit goes to Andy Whitcroft for spotting a problem with respects to IA-64 before releasing. Additional credit to David Gibson for testing with the libhugetlbfs test suite. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Tested-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-29 09:24:51 -08:00
Hugh Dickins	09f345da75	prep_zero_page: remove bogus BUG_ON 2.6.11 gave __GFP_ZERO's prep_zero_page a bogus "highmem may have to wait" assertion. Presumably added under the misconception that clear_highpage uses nonatomic kmap; but then and now it uses kmap_atomic, so no problem. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-28 11:04:28 -08:00
Hugh Dickins	e84e2e132c	tmpfs: restore missing clear_highpage tmpfs was misconverted to __GFP_ZERO in 2.6.11. There's an unusual case in which shmem_getpage receives the page from its caller instead of allocating. We must cover this case by clear_highpage before SetPageUptodate, as before. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-28 11:04:28 -08:00
Christian Borntraeger	ce7e9fae8d	[S390] Optimize storage key handling for anonymous pages page_mkclean used to call page_clear_dirty for every given page. This is different to all other architectures, where the dirty bit in the PTEs is only resetted, if page_mapping() returns a non-NULL pointer. We can move the page_test_dirty/page_clear_dirty sequence into the 2nd if to avoid unnecessary iske/sske sequences, which are expensive. This change also helps kvm for s390 as the host must transfer the dirty bit into the guest status bits. By moving the page_clear_dirty operation into the 2nd if, the vm will only call page_clear_dirty for pages where it walks the mapping anyway. There it calls ptep_clear_flush for writable ptes, so we can transfer the dirty bit to the guest. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	2007-11-20 11:13:46 +01:00
Linus Torvalds	8c0863403f	dirty page balancing: Get rid of broken unmapped_ratio logic This code harks back to the days when we didn't count dirty mapped pages, which led us to try to balance the number of dirty unmapped pages by how much unmapped memory there was in the system. That makes no sense any more, since now the dirty counts include the mapped pages. Not to mention that the math doesn't work with HIGHMEM machines anyway, and causes the unmapped_ratio to potentially turn negative (which we do catch thanks to clamping it at a minimum value, but I mention that as an indication of how broken the code is). The code also was written at a time when the default dirty ratio was much larger, and the unmapped_ratio logic effectively capped that large dirty ratio a bit. Again, we've since lowered the dirty ratio rather aggressively, further lessening the point of that code. Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-15 16:41:52 -08:00
Nick Piggin	d32ddd8f20	slob: fix memory corruption Previously, it would be possible for prev->next to point to &free_slob_pages, and thus we would try to move a list onto itself, and bad things would happen. It seems a bit hairy to be doing list operations with the list marker as an entry, rather than a head, but... this resolves the following crash: http://bugzilla.kernel.org/show_bug.cgi?id=9379 Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-15 08:36:27 -08:00
Balbir Singh	20a1022d4a	Swap delay accounting, include lock_page() delays The delay incurred in lock_page() should also be accounted in swap delay accounting Reported-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:44 -08:00
Randy Dunlap	42614fcde7	vmstat: fix section mismatch warning Mark start_cpu_timer() as __cpuinit instead of __devinit. Fixes this section warning: WARNING: vmlinux.o(.text+0x60e53): Section mismatch: reference to .init.text:start_cpu_timer (between 'vmstat_cpuup_callback' and 'vmstat_show') Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:42 -08:00
Adrian Bunk	be21f0ab0d	fix mm/util.c:krealloc() Commit `ef8b4520bd` added one NULL check for "p" in krealloc(), but that doesn't seem to be enough since there doesn't seem to be any guarantee that memcpy(ret, NULL, 0) works (spotted by the Coverity checker). For making it clearer what happens this patch also removes the pointless min(). Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:41 -08:00
Ken Chen	45c682a68a	hugetlb: fix i_blocks accounting For administrative purpose, we want to query actual block usage for hugetlbfs file via fstat. Currently, hugetlbfs always return 0. Fix that up since kernel already has all the information to track it properly. Signed-off-by: Ken Chen <kenchen@google.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:40 -08:00
Adrian Bunk	8cde045c7e	mm/hugetlb.c: make a function static return_unused_surplus_pages() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:40 -08:00
Adam Litke	90d8b7e612	hugetlb: enforce quotas during reservation for shared mappings When a MAP_SHARED mmap of a hugetlbfs file succeeds, huge pages are reserved to guarantee no problems will occur later when instantiating pages. If quotas are in force, page instantiation could fail due to a race with another process or an oversized (but approved) shared mapping. To prevent these scenarios, debit the quota for the full reservation amount up front and credit the unused quota when the reservation is released. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:40 -08:00
Adam Litke	9a119c056d	hugetlb: allow bulk updating in hugetlb_*_quota() Add a second parameter 'delta' to hugetlb_get_quota and hugetlb_put_quota to allow bulk updating of the sbinfo->free_blocks counter. This will be used by the next patch in the series. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:40 -08:00
Adam Litke	2fc39cec6a	hugetlb: debit quota in alloc_huge_page Now that quota is credited by free_huge_page(), calls to hugetlb_get_quota() seem out of place. The alloc/free API is unbalanced because we handle the hugetlb_put_quota() but expect the caller to open-code hugetlb_get_quota(). Move the get inside alloc_huge_page to clean up this disparity. This patch has been kept apart from the previous patch because of the somewhat dodgy ERR_PTR() use herein. Moving the quota logic means that alloc_huge_page() has two failure modes. Quota failure must result in a SIGBUS while a standard allocation failure is OOM. Unfortunately, ERR_PTR() doesn't like the small positive errnos we have in VM_FAULT_* so they must be negated before they are used. Does anyone take issue with the way I am using PTR_ERR. If so, what are your thoughts on how to clean this up (without needing an if,else if,else block at each alloc_huge_page() callsite)? Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:40 -08:00
Adam Litke	c79fb75e5a	hugetlb: fix quota management for private mappings The hugetlbfs quota management system was never taught to handle MAP_PRIVATE mappings when that support was added. Currently, quota is debited at page instantiation and credited at file truncation. This approach works correctly for shared pages but is incomplete for private pages. In addition to hugetlb_no_page(), private pages can be instantiated by hugetlb_cow(); but this function does not respect quotas. Private huge pages are treated very much like normal, anonymous pages. They are not "backed" by the hugetlbfs file and are not stored in the mapping's radix tree. This means that private pages are invisible to truncate_hugepages() so that function will not credit the quota. This patch (based on a prototype provided by Ken Chen) moves quota crediting for all pages into free_huge_page(). page->private is used to store a pointer to the mapping to which this page belongs. This is used to credit quota on the appropriate hugetlbfs instance. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: Ken Chen <kenchen@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:40 -08:00
Adam Litke	348ea204cc	hugetlb: split alloc_huge_page into private and shared components Hugetlbfs implements a quota system which can limit the amount of memory that can be used by the filesystem. Before allocating a new huge page for a file, the quota is checked and debited. The quota is then credited when truncating the file. I found a few bugs in the code for both MAP_PRIVATE and MAP_SHARED mappings. Before detailing the problems and my proposed solutions, we should agree on a definition of quotas that properly addresses both private and shared pages. Since the purpose of quotas is to limit total memory consumption on a per-filesystem basis, I argue that all pages allocated by the fs (private and shared) should be charged against quota. Private Mappings ================ The current code will debit quota for private pages sometimes, but will never credit it. At a minimum, this causes a leak in the quota accounting which renders the accounting essentially useless as it is. Shared pages have a one to one mapping with a hugetlbfs file and are easy to account by debiting on allocation and crediting on truncate. Private pages are anonymous in nature and have a many to one relationship with their hugetlbfs files (due to copy on write). Because private pages are not indexed by the mapping's radix tree, thier quota cannot be credited at file truncation time. Crediting must be done when the page is unmapped and freed. Shared Pages ============ I discovered an issue concerning the interaction between the MAP_SHARED reservation system and quotas. Since quota is not checked until page instantiation, an over-quota mmap/reservation will initially succeed. When instantiating the first over-quota page, the program will receive SIGBUS. This is inconsistent since the reservation is supposed to be a guarantee. The solution is to debit the full amount of quota at reservation time and credit the unused portion when the reservation is released. This patch series brings quotas back in line by making the following modifications: * Private pages - Debit quota in alloc_huge_page() - Credit quota in free_huge_page() * Shared pages - Debit quota for entire reservation at mmap time - Credit quota for instantiated pages in free_huge_page() - Credit quota for unused reservation at munmap time This patch: The shared page reservation and dynamic pool resizing features have made the allocation of private vs. shared huge pages quite different. By splitting out the private/shared-specific portions of the process into their own functions, readability is greatly improved. alloc_huge_page now calls the proper helper and performs common operations. [akpm@linux-foundation.org: coding-style cleanups] Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:39 -08:00
Adam Litke	5b23dbe817	hugetlb: follow_hugetlb_page() for write access When calling get_user_pages(), a write flag is passed in by the caller to indicate if write access is required on the faulted-in pages. Currently, follow_hugetlb_page() ignores this flag and always faults pages for read-only access. This can cause data corruption because a device driver that calls get_user_pages() with write set will not expect COW faults to occur on the returned pages. This patch passes the write flag down to follow_hugetlb_page() and makes sure hugetlb_fault() is called with the right write_access parameter. [ezk@cs.sunysb.edu: build fix] Signed-off-by: Adam Litke <agl@us.ibm.com> Reviewed-by: Ken Chen <kenchen@google.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:39 -08:00
Yasunori Goto	887c3cb188	Add IORESOUCE_BUSY flag for System RAM i386 and x86-64 registers System RAM as IORESOURCE_MEM \| IORESOURCE_BUSY. But ia64 registers it as IORESOURCE_MEM only. In addition, memory hotplug code registers new memory as IORESOURCE_MEM too. This difference causes a failure of memory unplug of x86-64. This patch fixes it. This patch adds IORESOURCE_BUSY to avoid potential overlap mapping by PCI device. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Luck, Tony" <tony.luck@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:39 -08:00
Peter Zijlstra	5fce25a9df	mm: speed up writeback ramp-up on clean systems We allow violation of bdi limits if there is a lot of room on the system. Once we hit half the total limit we start enforcing bdi limits and bdi ramp-up should happen. Doing it this way avoids many small writeouts on an otherwise idle system and should also speed up the ramp-up. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:38 -08:00
KAMEZAWA Hiroyuki	dbc0e4cefd	memory hotremove: unset migrate type "ISOLATE" after removal We should unset migrate type "ISOLATE" when we successfully removed memory. But current code has BUG and cannot works well. This patch also includes bugfix? to change get_pageblock_flags to get_pageblock_migratetype(). Thanks to Badari Pulavarty for finding this. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:38 -08:00
Lee Schermerhorn	3ad33b2436	Migration: find correct vma in new_vma_page() We hit the BUG_ON() in mm/rmap.c:vma_address() when trying to migrate via mbind(MPOL_MF_MOVE) a non-anon region that spans multiple vmas. For anon-regions, we just fail to migrate any pages beyond the 1st vma in the range. This occurs because do_mbind() collects a list of pages to migrate by calling check_range(). check_range() walks the task's mm, spanning vmas as necessary, to collect the migratable pages into a list. Then, do_mbind() calls migrate_pages() passing the list of pages, a function to allocate new pages based on vma policy [new_vma_page()], and a pointer to the first vma of the range. For each page in the list, new_vma_page() calls page_address_in_vma() passing the page and the vma [first in range] to obtain the address to get for alloc_page_vma(). The page address is needed to get interleaving policy correct. If the pages in the list come from multiple vmas, eventually, new_page_address() will pass that page to page_address_in_vma() with the incorrect vma. For !PageAnon pages, this will result in a bug check in rmap.c:vma_address(). For anon pages, vma_address() will just return EFAULT and fail the migration. This patch modifies new_vma_page() to check the return value from page_address_in_vma(). If the return value is EFAULT, new_vma_page() searchs forward via vm_next for the vma that maps the page--i.e., that does not return EFAULT. This assumes that the pages in the list handed to migrate_pages() is in address order. This is currently case. The patch documents this assumption in a new comment block for new_vma_page(). If new_vma_page() cannot locate the vma mapping the page in a forward search in the mm, it will pass a NULL vma to alloc_page_vma(). This will result in the allocation using the task policy, if any, else system default policy. This situation is unlikely, but the patch documents this behavior with a comment. Note, this patch results in restarting from the first vma in a multi-vma range each time new_vma_page() is called. If this is not acceptable, we can make the vma argument a pointer, both in new_vma_page() and it's caller unmap_and_move() so that the value held by the loop in migrate_pages() always passes down the last vma in which a page was found. This will require changes to all new_page_t functions passed to migrate_pages(). Is this necessary? For this patch to work, we can't bug check in vma_address() for pages outside the argument vma. This patch removes the BUG_ON(). All other callers [besides new_vma_page()] already check the return status. Tested on x86_64, 4 node NUMA platform. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:38 -08:00
Akinobu Mita	cc550defe9	slab: fix typo in allocation failure handling This patch fixes wrong array index in allocation failure handling. Cc: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:36 -08:00
Linus Torvalds	44048d700b	Revert "Bias the placement of kernel pages at lower PFNs" This reverts commit `5adc5be7cd`. Alexey Dobriyan reports that it causes huge slowdowns under some loads, in his case a "mkfs.ext2" on a 30G partition. With the placement bias, the mkfs took over four minutes, with it reverted it's back to about ten seconds for Alexey. Reported-and-tested-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-12 14:14:44 -08:00
Denis Cheng	efe44183f6	SLUB: killed the unused "end" variable Since the macro "for_each_object" introduced, the "end" variable becomes unused anymore. Signed-off-by: Denis Cheng <crquan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-12 10:32:29 -08:00
Linus Torvalds	221d46841b	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest: lguest: tidy up documentation kernel/futex.c: make 3 functions static unexport access_process_vm lguest: make async_hcall() static	2007-11-05 11:39:00 -08:00
Christoph Lameter	05aa345034	SLUB: Fix memory leak by not reusing cpu_slab Fix the memory leak that may occur when we attempt to reuse a cpu_slab that was allocated while we reenabled interrupts in order to be able to grow a slab cache. The per cpu freelist may contain objects and in that situation we may overwrite the per cpu freelist pointer loosing objects. This only occurs if we find that the concurrently allocated slab fits our allocation needs. If we simply always deactivate the slab then the freelist will be properly reintegrated and the memory leak will go away. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-05 11:37:12 -08:00
Adrian Bunk	02c3530da6	unexport access_process_vm This patch removes the no longer used EXPORT_SYMBOL_GPL(access_process_vm). Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2007-11-05 21:53:39 +11:00
Linus Torvalds	5307cc1aa5	Remove broken ptrace() special-case code from file mapping The kernel has for random historical reasons allowed ptrace() accesses to access (and insert) pages into the page cache above the size of the file. However, Nick broke that by mistake when doing the new fault handling in commit `54cb8821de` ("mm: merge populate and nopage into fault (fixes nonlinear)". The breakage caused a hang with gdb when trying to access the invalid page. The ptrace "feature" really isn't worth resurrecting, since it really is wrong both from a portability _and_ from an internal page cache validity standpoint. So this removes those old broken remnants, and fixes the ptrace() hang in the process. Noticed and bisected by Duane Griffin, who also supplied a test-case (quoth Nick: "Well that's probably the best bug report I've ever had, thanks Duane!"). Cc: Duane Griffin <duaneg@dghda.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-31 09:19:46 -07:00
Zach Brown	bdb76ef5a4	dio: fix cache invalidation after sync writes Commit commit `65b8291c40` ("dio: invalidate clean pages before dio write") introduced a bug which stopped dio from ever invalidating the page cache after writes. It still invalidated it before writes so most users were fine. Karl Schendel reported ( http://lkml.org/lkml/2007/10/26/481 ) hitting this bug when he had a buffered reader immediately reading file data after an O_DIRECT wirter had written the data. The kernel issued read-ahead beyond the position of the reader which overlapped with the O_DIRECT writer. The failure to invalidate after writes caused the reader to see stale data from the read-ahead. The following patch is originally from Karl. The following commentary is his: The below 3rd try takes on your suggestion of just invalidating no matter what the retval from the direct_IO call. I ran it thru the test-case several times and it has worked every time. The post-invalidate is probably still too early for async-directio, but I don't have a testcase for that; just sync. And, this won't be any worse in the async case. I added a test to the aio-dio-regress repository which mimics Karl's IO pattern. It verifed the bad behaviour and that the patch fixed it. I agree with Karl, this still doesn't help the case where a buffered reader follows an AIO O_DIRECT writer. That will require a bit more work. This gives up on the idea of returning EIO to indicate to userspace that stale data remains if the invalidation failed. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Karl Schendel <kschendel@datallegro.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Leonid Ananiev <leonid.i.ananiev@linux.intel.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-30 12:14:06 -07:00
Hugh Dickins	487e9bf25c	fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE It's possible to provoke unionfs (not yet in mainline, though in mm and some distros) to hit shmem_writepage's BUG_ON(page_mapped(page)). I expect it's possible to provoke the 2.6.23 ecryptfs in the same way (but the 2.6.24 ecryptfs no longer calls lower level's ->writepage). This came to light with the recent find that AOP_WRITEPAGE_ACTIVATE could leak from tmpfs via write_cache_pages and unionfs to userspace. There's already a fix (`e423003028` - writeback: don't propagate AOP_WRITEPAGE_ACTIVATE) in the tree for that, and it's okay so far as it goes; but insufficient because it doesn't address the underlying issue, that shmem_writepage expects to be called only by vmscan (relying on backing_dev_info capabilities to prevent the normal writeback path from ever approaching it). That's an increasingly fragile assumption, and ramdisk_writepage (the other source of AOP_WRITEPAGE_ACTIVATEs) is already careful to check wbc->for_reclaim before returning it. Make the same check in shmem_writepage, thereby sidestepping the page_mapped BUG also. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Erez Zadok <ezk@cs.sunysb.edu> Cc: <stable@kernel.org> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-30 08:06:55 -07:00
Glauber de Oliveira Costa	8bca44bbd3	mm/sparse-vmemmap.c: make sure init_mm is included mm/sparse-vmemmap.c uses init_mm in some places. However, it is not present in any of the headers currently included in the file. init_mm is defined as extern in sched.h, so we add it to the headers list Up to now, this problem was masked by the fact that functions like set_pte_at() and pmd_populate_kernel() are usually macros that expand to simpler variants that does not use the first parameter at all. Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-30 08:06:55 -07:00
Linus Torvalds	6a22c57b8d	Revert "x86_64: allocate sparsemem memmap above 4G" This reverts commit `2e1c49db4c`. First off, testing in Fedora has shown it to cause boot failures, bisected down by Martin Ebourne, and reported by Dave Jobes. So the commit will likely be reverted in the 2.6.23 stable kernels. Secondly, in the 2.6.24 model, x86-64 has now grown support for SPARSEMEM_VMEMMAP, which disables the relevant code anyway, so while the bug is not visible any more, it's become invisible due to the code just being irrelevant and no longer enabled on the only architecture that this ever affected. Reported-by: Dave Jones <davej@redhat.com> Tested-by: Martin Ebourne <fedora@ebourne.me.uk> Cc: Zou Nan hai <nanhai.zou@intel.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-29 14:05:37 -07:00
David Howells	f2b8544f5f	NOMMU: mm/nommu.c needs linux/module.h mm/nommu.c needs to #include linux/module.h for it to understand EXPORT_*() macros. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-29 07:53:26 -07:00
Linus Torvalds	cbf67812b2	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: compat_ioctl: fix block device compat ioctl regression [BLOCK] Fix bad sharing of tag busy list on queues with shared tag maps Fix a build error when BLOCK=n block: use lock bitops for the tag map. cciss: update copyright notices cfq_get_queue: fix possible NULL pointer access blk_sync_queue() should cancel request_queue->unplug_work cfq_exit_queue() should cancel cfq_data->unplug_work block layer: remove a unused argument of drive_stat_acct()	2007-10-29 07:49:28 -07:00

1 2 3 4 5 ...

1758 Commits